How do you think of my subset function?

for a given list (maybe very big) I want sample the subset by given count.

import scala.util._

object Randomset {

  def subset[A](x:List[A],y:Int):List[A] = {

    val size = x.size
    if (size < y) {
      throw new Exception
    }

    val hash = scala.collection.mutable.Map[A,Int]()

    while ( hash.keys.size < y  ) {
      val idx = Random.between(0,size)
      hash( x(idx) ) = 1
    }

    hash.keys.toList
  }

  def main(args:Array[String]):Unit = {
    val li = List(3,2,1,4,9,8,7)
    val subli = subset(li,3)
    println(subli)

    val li2 = List("aa","bb","cc","dd","ee","ff","gg","hh","ii","jj")
    val subli2 = subset(li2,3)
    println(subli2)
  }
}

Do you have any suggestion for my this implementation?

Thanks

BTW, in my case the input list has no duplicated data. if there was the duplicated items, this program has bugs. such as the code will never end:

  def main(args:Array[String]):Unit = {
    val li = List(3,2,1,3,3,3)
    val subli = subset(li,5)
    println(subli)
  }

Thanks

for the issue above, I have improved the function as following. suggestion are welcome. :slight_smile:

  def subset[A](x:List[A],y:Int):List[A] = {

    if (x.toSet.size < y) {
      throw new Exception
    }

    val hash = scala.collection.mutable.LinkedHashMap[A,Int]()

    while ( hash.keys.size < y  ) {
      val idx = Random.between(0,x.size)
      hash( x(idx) ) = 1
    }

    hash.keys.toList
  }

yes i know the ‘toSet’ is very worse in performance here. but I didnt get an idea how to remove the duplicated items for count if there were.

Accessing an element by index on a List is O(n) so this is extremely inefficient; also what is the point of a Map whose values are constants, just use a mutable.Set or a mutable.ListBuilder directly.

BTW, what was wrong with the suggestion I shared in the other thread?

I have not checked the third party implementation, just want to share my idea to making that one .:slight_smile:

yes thanks for your suggestion.
but I have tested many times, using Set is even slower than the HashMap version.
This is the set version:

import scala.util._

object Randomset {

  def subset[A](x:List[A],y:Int):List[A] = {

    if (x.size < y) {
      throw new Exception
    }

    val set = scala.collection.mutable.Set[A]()

    while ( set.size < y  ) {
      val idx = Random.between(0,x.size)
      set += x(idx)
    }

    set.toList
  }

  def main(args:Array[String]):Unit = {
    val li = List.range(1,1000000)
    val subli = subset(li,5)
    println(subli)
  }
}

I finally decided to use this version, it resolved the duplicated issue, and with an acceptable performance.

import scala.util._

object Randomset {

  def subset[A](x:List[A],y:Int):List[A] = {

    if (x.size < y) {
      return x
    }

    val set = scala.collection.mutable.Set[Int]()

    while ( set.size < y  ) {
      val idx = Random.between(0,x.size)
      set += idx
    }

    set.toList.map(i => x(i) )
  }

  def main(args:Array[String]):Unit = {
    val li = List(3,2,1,4,3,3,3)
    val subli = subset(li,5)
    println(subli)
  }
}

Thank you.

can scala add more functions to the data industry use? such as the methods in pandas or numpy, which are used widely.

I am pretty sure the only reason why Set was slower is because you keep accessing the List by index which is extremely inefficient and may vary a lot from execution to execution (plus that I am sure that you are not doing a proper benchmark anyways so any appreciation of performance you may have is meaningless).

can scala add more functions to the data industry use?

There’s a very elegant and satisfying algorithm for this that I love telling people about. I don’t know who invented it. (I suppose it’s in Knuth somewhere.)

Basically you walk down the list in order and at each step, you ask yourself: what is the chance of this item being included in the result? Then you roll the dice, select the element or not accordingly, and proceed until you have the required number of elements.

The odds at each step depend on how many items you’ve processed so far and how many of items have been selected so far.

The order of the input is preserved in the output, which you might consider desirable, or not desirable, depending on your goals. If it’s not desirable, you can shuffle the result on the way out.

The same algorithm works regardless of whether you have efficient random access to the collection (e.g. Vector) or you don’t (e.g. List). If you have efficient random access, then you iterate by incrementing an index. If you don’t, you iterate by repeatedly taking the tail.

Here’s the List version:

def randomSubset[T](n: Int, xs: List[T]): List[T] = {
  val result = collection.mutable.ListBuffer[T]()
  var i = 0
  var resultSize = 0
  val size = xs.size
  var rest = xs
  while (resultSize < n) {
    if (util.Random.nextInt(size - i) < n - resultSize) {
      result += rest.head
      resultSize += 1
    }
    i += 1
    rest = rest.tail
  }
  result.toList
}

So beautiful!

What about efficiency? If your input collection allows efficient random access and n is small relative to the input size, then you’re better off picking random indices and ignoring duplicates. But if those conditions don’t hold, this algorithm is a contender. Like if you have 100 items and need to pick a random 50 of them, this is a good way.

4 Likes

Isn’t that mostly the same as what I did?
Except with mutability and with a variable probability that ensures a single traversal.

Thank you Seth. But I insist on my version, which doesn’t use the var but using val only.

  def subset[A](x:List[A],y:Int):List[A] = {

    if (x.size < y) {
      return x
    }

    val set = scala.collection.mutable.Set[Int]()

    while ( set.size < y  ) {
      val idx = Random.between(0,x.size)
      set += idx
    }

    set.toList.map(i => x(i) )
  }

Isn’t that mostly the same as what I did?

Not the same at all. Doesn’t solve the same problem.

On input of List.range(0, 100) and n = 10, a typical answer returned by your code is:

List(26, 21, 20, 18, 17, 14, 11, 4, 3, 2)

note that all of the numbers returned are clustered together in the initial part of the range. I believe that’s not what @pengyh wanted. All items in the input collection should have an equal chance of being included in the result.

2 Likes

That reminds me why I almost lost statistics :sweat_smile:

Thanks!

3 Likes

@SethTisue, that is a beautiful algorithm. Thanks for taking the time to post it.

@pengyh, note that the code Seth wrote could be nicely converted to a recursive function, that wouldn’t include the var.

3 Likes