How do you think of my subset function?

pengyh · February 24, 2022, 2:47am

for a given list (maybe very big) I want sample the subset by given count.

import scala.util._

object Randomset {

  def subset[A](x:List[A],y:Int):List[A] = {

    val size = x.size
    if (size < y) {
      throw new Exception
    }

    val hash = scala.collection.mutable.Map[A,Int]()

    while ( hash.keys.size < y  ) {
      val idx = Random.between(0,size)
      hash( x(idx) ) = 1
    }

    hash.keys.toList
  }

  def main(args:Array[String]):Unit = {
    val li = List(3,2,1,4,9,8,7)
    val subli = subset(li,3)
    println(subli)

    val li2 = List("aa","bb","cc","dd","ee","ff","gg","hh","ii","jj")
    val subli2 = subset(li2,3)
    println(subli2)
  }
}

Do you have any suggestion for my this implementation?

Thanks

pengyh · February 24, 2022, 3:14am

BTW, in my case the input list has no duplicated data. if there was the duplicated items, this program has bugs. such as the code will never end:

  def main(args:Array[String]):Unit = {
    val li = List(3,2,1,3,3,3)
    val subli = subset(li,5)
    println(subli)
  }

Thanks

pengyh · February 24, 2022, 3:24am

for the issue above, I have improved the function as following. suggestion are welcome.

  def subset[A](x:List[A],y:Int):List[A] = {

    if (x.toSet.size < y) {
      throw new Exception
    }

    val hash = scala.collection.mutable.LinkedHashMap[A,Int]()

    while ( hash.keys.size < y  ) {
      val idx = Random.between(0,x.size)
      hash( x(idx) ) = 1
    }

    hash.keys.toList
  }

pengyh · February 24, 2022, 3:28am

yes i know the ‘toSet’ is very worse in performance here. but I didnt get an idea how to remove the duplicated items for count if there were.

BalmungSan · February 24, 2022, 3:39am

Accessing an element by index on a List is O(n) so this is extremely inefficient; also what is the point of a Map whose values are constants, just use a mutable.Set or a mutable.ListBuilder directly.

BTW, what was wrong with the suggestion I shared in the other thread?

pengyh · February 24, 2022, 3:41am

I have not checked the third party implementation, just want to share my idea to making that one .

pengyh · February 24, 2022, 3:49am

yes thanks for your suggestion.
but I have tested many times, using Set is even slower than the HashMap version.
This is the set version:

import scala.util._

object Randomset {

  def subset[A](x:List[A],y:Int):List[A] = {

    if (x.size < y) {
      throw new Exception
    }

    val set = scala.collection.mutable.Set[A]()

    while ( set.size < y  ) {
      val idx = Random.between(0,x.size)
      set += x(idx)
    }

    set.toList
  }

  def main(args:Array[String]):Unit = {
    val li = List.range(1,1000000)
    val subli = subset(li,5)
    println(subli)
  }
}

pengyh · February 24, 2022, 4:40am

I finally decided to use this version, it resolved the duplicated issue, and with an acceptable performance.

import scala.util._

object Randomset {

  def subset[A](x:List[A],y:Int):List[A] = {

    if (x.size < y) {
      return x
    }

    val set = scala.collection.mutable.Set[Int]()

    while ( set.size < y  ) {
      val idx = Random.between(0,x.size)
      set += idx
    }

    set.toList.map(i => x(i) )
  }

  def main(args:Array[String]):Unit = {
    val li = List(3,2,1,4,3,3,3)
    val subli = subset(li,5)
    println(subli)
  }
}

Thank you.

pengyh · February 24, 2022, 4:46am

can scala add more functions to the data industry use? such as the methods in pandas or numpy, which are used widely.

BalmungSan · February 24, 2022, 1:13pm

I am pretty sure the only reason why Set was slower is because you keep accessing the List by index which is extremely inefficient and may vary a lot from execution to execution (plus that I am sure that you are not doing a proper benchmark anyways so any appreciation of performance you may have is meaningless).

can scala add more functions to the data industry use?

SethTisue · March 12, 2022, 1:58am

There’s a very elegant and satisfying algorithm for this that I love telling people about. I don’t know who invented it. (I suppose it’s in Knuth somewhere.)

Basically you walk down the list in order and at each step, you ask yourself: what is the chance of this item being included in the result? Then you roll the dice, select the element or not accordingly, and proceed until you have the required number of elements.

The odds at each step depend on how many items you’ve processed so far and how many of items have been selected so far.

The order of the input is preserved in the output, which you might consider desirable, or not desirable, depending on your goals. If it’s not desirable, you can shuffle the result on the way out.

The same algorithm works regardless of whether you have efficient random access to the collection (e.g. Vector) or you don’t (e.g. List). If you have efficient random access, then you iterate by incrementing an index. If you don’t, you iterate by repeatedly taking the tail.

Here’s the List version:

def randomSubset[T](n: Int, xs: List[T]): List[T] = {
  val result = collection.mutable.ListBuffer[T]()
  var i = 0
  var resultSize = 0
  val size = xs.size
  var rest = xs
  while (resultSize < n) {
    if (util.Random.nextInt(size - i) < n - resultSize) {
      result += rest.head
      resultSize += 1
    }
    i += 1
    rest = rest.tail
  }
  result.toList
}

So beautiful!

What about efficiency? If your input collection allows efficient random access and n is small relative to the input size, then you’re better off picking random indices and ignoring duplicates. But if those conditions don’t hold, this algorithm is a contender. Like if you have 100 items and need to pick a random 50 of them, this is a good way.

BalmungSan · March 12, 2022, 4:15am

Isn’t that mostly the same as what I did?
Except with mutability and with a variable probability that ensures a single traversal.

pengyh · March 12, 2022, 8:04am

Thank you Seth. But I insist on my version, which doesn’t use the var but using val only.

  def subset[A](x:List[A],y:Int):List[A] = {

    if (x.size < y) {
      return x
    }

    val set = scala.collection.mutable.Set[Int]()

    while ( set.size < y  ) {
      val idx = Random.between(0,x.size)
      set += idx
    }

    set.toList.map(i => x(i) )
  }

SethTisue · March 12, 2022, 4:12pm

Isn’t that mostly the same as what I did?

Not the same at all. Doesn’t solve the same problem.

On input of List.range(0, 100) and n = 10, a typical answer returned by your code is:

List(26, 21, 20, 18, 17, 14, 11, 4, 3, 2)

note that all of the numbers returned are clustered together in the initial part of the range. I believe that’s not what @pengyh wanted. All items in the input collection should have an equal chance of being included in the result.

BalmungSan · March 12, 2022, 4:20pm

That reminds me why I almost lost statistics

Thanks!

MarkCLewis · March 13, 2022, 9:45pm

@SethTisue, that is a beautiful algorithm. Thanks for taking the time to post it.

@pengyh, note that the code Seth wrote could be nicely converted to a recursive function, that wouldn’t include the var.