Accessing an element by index on a List is O(n) so this is extremely inefficient; also what is the point of a Map whose values are constants, just use a mutable.Set or a mutable.ListBuilder directly.
BTW, what was wrong with the suggestion I shared in the other thread?
yes thanks for your suggestion.
but I have tested many times, using Set is even slower than the HashMap version.
This is the set version:
import scala.util._
object Randomset {
def subset[A](x:List[A],y:Int):List[A] = {
if (x.size < y) {
throw new Exception
}
val set = scala.collection.mutable.Set[A]()
while ( set.size < y ) {
val idx = Random.between(0,x.size)
set += x(idx)
}
set.toList
}
def main(args:Array[String]):Unit = {
val li = List.range(1,1000000)
val subli = subset(li,5)
println(subli)
}
}
I am pretty sure the only reason why Set was slower is because you keep accessing the List by index which is extremely inefficient and may vary a lot from execution to execution (plus that I am sure that you are not doing a proper benchmark anyways so any appreciation of performance you may have is meaningless).
can scala add more functions to the data industry use?
There’s a very elegant and satisfying algorithm for this that I love telling people about. I don’t know who invented it. (I suppose it’s in Knuth somewhere.)
Basically you walk down the list in order and at each step, you ask yourself: what is the chance of this item being included in the result? Then you roll the dice, select the element or not accordingly, and proceed until you have the required number of elements.
The odds at each step depend on how many items you’ve processed so far and how many of items have been selected so far.
The order of the input is preserved in the output, which you might consider desirable, or not desirable, depending on your goals. If it’s not desirable, you can shuffle the result on the way out.
The same algorithm works regardless of whether you have efficient random access to the collection (e.g. Vector) or you don’t (e.g. List). If you have efficient random access, then you iterate by incrementing an index. If you don’t, you iterate by repeatedly taking the tail.
Here’s the List version:
def randomSubset[T](n: Int, xs: List[T]): List[T] = {
val result = collection.mutable.ListBuffer[T]()
var i = 0
var resultSize = 0
val size = xs.size
var rest = xs
while (resultSize < n) {
if (util.Random.nextInt(size - i) < n - resultSize) {
result += rest.head
resultSize += 1
}
i += 1
rest = rest.tail
}
result.toList
}
So beautiful!
What about efficiency? If your input collection allows efficient random access and n is small relative to the input size, then you’re better off picking random indices and ignoring duplicates. But if those conditions don’t hold, this algorithm is a contender. Like if you have 100 items and need to pick a random 50 of them, this is a good way.
Thank you Seth. But I insist on my version, which doesn’t use the var but using val only.
def subset[A](x:List[A],y:Int):List[A] = {
if (x.size < y) {
return x
}
val set = scala.collection.mutable.Set[Int]()
while ( set.size < y ) {
val idx = Random.between(0,x.size)
set += idx
}
set.toList.map(i => x(i) )
}
Not the same at all. Doesn’t solve the same problem.
On input of List.range(0, 100) and n = 10, a typical answer returned by your code is:
List(26, 21, 20, 18, 17, 14, 11, 4, 3, 2)
note that all of the numbers returned are clustered together in the initial part of the range. I believe that’s not what @pengyh wanted. All items in the input collection should have an equal chance of being included in the result.