Find the N minimum elements

Can someone suggest the fastest way to take the `N` minimum elements from a `List`, assuming the length of the list is much greater than N. In my case N=10 and the list length is about 8500.

Currently I’m sorting the array and calling `take(10)`, but it turns out this loop is in the critical path of my program.

And by the way, I’m referring to minimum w.r.t. a measure function.

``````data.sortBy(measure).take(n)
``````

I’m tempted to copy the list into a mutable Array, and then implement in-place `selectionSort` which stops after it has swapped N elements, but I was hoping there was a more elegant way.

I assume that using a `View` is sufficient for not doing more sorting than necessary.

``````data.view.sortBy(measure).take(n).toList
``````

For some reason you need a `SeqView`, which is what you’ll get when calling `view` on a `List`. So as long as you’re actually working with a `List` or some other `Seq` you should be good to go.

I’d love to hear what that suggestion does? How does it find the minimum n elements without sorting the entire list?

I just did a small test, and I suspect that it doesn’t, which is a bit unfortunate.
I think it should be able to if it is sufficiently lazy.

What I’m currently using is the following, which speeds up my program by an order of magnitude. This sort of uses a binary search to find the n least elements. Although the code is pretty tedious, and is certain not to work in cases where n is large compared to the length of the list.

``````  def takeNsmallest[T](n:Int, data:List[T], measure:T=>Double):List[T] = {
val pivot :: tail = data
val reference = measure(pivot)
val (small, large) = tail.partition( measure(_) <= reference)
val len = small.length
if (len == n - 1)
pivot :: small
else if (len > n - 1)
takeNsmallest(n, small, measure)
else
pivot :: small ++ takeNsmallest(n - len - 1, large, measure)
}
``````

Expanding on this answer a bit. A `View` is lazy so it should in theory only compute the amount of elements that it needs at the point when you call `toList` (or something else that materializes it) on it. In the case of sorting I think there is a bit of a tradeoff between sorting efficiently (faster than O(n²)) and sorting lazily. In this implementation efficiency was chosen. Which, ironically, in your usecase is not very efficient.

yes, but even if it were perfectly lazy, you can’t can’t find the first N sorted elements without sorting. Right? I.e., the compare function is only called inside `sortBy`. Any element which the compare function didn’t get called with might be the smallest element.

In the most naive implementation, you could find the `n` smallest elements in a `m` sized `List` with `n` traversals. You don’t need to sort all `m` elements. But it would probably only be helpful when `n` is very very small in comparison to `m`.

If you do selection sort on a list of 10_000_000 elements but only need the 5 smallest, you can stop sorting after 5 iterations.

YIKES!!! my code is wrong. I’m using `span` rather than `partition`

I can never remember the names of those functions I want to partition the list into those `<= pivot` and those `> pivot`.

Good thing I wrong some test cases

Just a naive idea, untested and no idea how it compares to other approaches performance-wise.

``````def minSet[T : Ordering](data: Seq[T], count: Int): SortedSet[T] = {
import Ordering.Implicits._
def step(mins: TreeSet[T], cur: T): TreeSet[T] = {
if(mins.size < count)
mins + cur
else {
if(mins.last > cur)
mins - mins.last + cur
else
mins
}
}
data.foldLeft(TreeSet.empty[T])(step)
}
``````

This was more or less what I was thinking of:

``````import scala.collection.mutable.ArrayBuffer
import scala.collection.View

class SortedIterator[A](arr: ArrayBuffer[A], ord: Ordering[A]) extends Iterator[A] {
private[this] var pointer = 0
def next(): A = {
if (hasNext) {
var iMin = pointer
var min = arr(iMin)
var i = pointer
while (i < arr.length) {
if (ord.lt(arr(i), min)) {
iMin = i
min = arr(iMin)
}
i += 1
}
arr(iMin) = arr(pointer)
arr(pointer) = min
pointer += 1
min
}
else throw new NoSuchElementException()
}
def hasNext: Boolean = pointer < arr.length
override def knownSize: Int = arr.length - pointer
}

implicit class SortedSyntax[A](coll: collection.IterableOnce[A]) {
def sortedIterator(implicit ord: Ordering[A]): Iterator[A] = new SortedIterator(coll.iterator.to(ArrayBuffer), ord)
def sortedView(implicit ord: Ordering[A]): View[A] = new View[A] {
def iterator = sortedIterator
}
}
``````

Then `list.sortedIterator.take(5)` is noticably faster for a large enough `list`.
You obviously have to force the entire underlying collection, but the sorting itself happens lazily (and, obviously, not efficiently if you would sort the whole list like this).

@sangamon’s solution is more or less the way I would do it, provided that `mins.last` and `mins.size` are O(1). If not (I don’t know `TreeSet` well enough to be sure), I might maintain the current high-water mark and the current size as separate values, to ensure O(1) behavior for each `step` unless you want to add it to the collection.

Regardless of the details, sorting seems unnecessary and very expensive. Given a full List of length N, and seeking the M smallest, it seems like you should be able to do this in no worse than O(M*N) (and probably much better) if you do some variation of this solution, scanning the List once and maintaining separately the collection of smallest values…

At a quick glance, should be O(1) for `#size` and O(log n) for `#last`.

I’ve just thrown @jimka’s initial sort, @Jasper-M’s iterator and my fold at jmh.

``````Benchmark                  Mode  Cnt      Score     Error  Units
NMinBench.benchFold       thrpt   10  13923.095 ± 156.952  ops/s
NMinBench.benchIter       thrpt   10   5262.884 ±  34.256  ops/s
NMinBench.benchPlainSort  thrpt   10    749.731 ±   4.889  ops/s
``````

I still haven’t tested my implementation, yet, though.

`benchFold`, that must be `minSet` ?
Will `minSet` work with a `measure` function like `takeNsmallest` and `sortBy` ?
Is high score good? What is ops/s ?

How do they compare to `takeNsmallest`?

I think this could also be done with a heap data structure. `makeHeap` has complexity O(n), and `removeMin` can be done in O(log n), so that would be O(n + m log(n)) = O(n). Right?

@sangamon, are you sure these functions actually work? The minimum two elements of `List(1,2,1)` should be `List(1,1)`, not `List(1,2)`.

This looks to be an instance of the Partial Sort problem: https://en.wikipedia.org/wiki/Partial_sorting

Which, IIRC, the specialized quicksort works well.

There is an interesting discussion of how a laziness enables a (pseudo?) quicksort to do this auto-magically here: https://apfelmus.nfshost.com/articles/quicksearch.html

3 Likes

To the contrary, I’m sure I said twice that I haven’t really tested it.

Yes, using a sorted set, this approach doesn’t fly. Swapping out the set for a priority queue (as you already have suggested) works, though.

``````def minSet[T : Ordering](data: Seq[T], count: Int): Seq[T] = {
import Ordering.Implicits._
def step(mins: mut.PriorityQueue[T], cur: T): mut.PriorityQueue[T] = {
if(mins.size < count)
mins.enqueue(cur)
else {
mins.dequeue()
mins.enqueue(cur)
}
else
()
}
mins
}
data.foldLeft(new mut.PriorityQueue[T]())(step).toSeq
}
``````

Tested, this time.

``````forAll(listOf(arbitrary[Int]), posNum[Int]) { (data, count) =>
whenever(count >= 0) {
MinSet.minSet(data, count) should contain theSameElementsAs data.sorted.take(count)
}
}
``````

Yup.

Even better, it takes an `Ordering`.

Operations per second. Higher is better.

New run, including `takeNSmallest` and the fixed, priority queue based version of the “fold” approach.

``````Benchmark                  Mode  Cnt      Score     Error  Units
NMinBench.benchFold       thrpt   10  15913.725 ± 196.195  ops/s
NMinBench.benchIter       thrpt   10   4822.864 ±  43.467  ops/s
NMinBench.benchNSmallest  thrpt   10   3411.451 ± 647.105  ops/s
NMinBench.benchPlainSort  thrpt   10    740.088 ±  13.779  ops/s
``````

Code here.

This is something I’ve never really understood. How can I sort the same data using different measure functions using an `Ordering`? I.e., the order depends on the application, not the data itself. I.e., I might want to sort strings by length, or alphabetically, or alphabetically-case-independent, or alphabetically-ignoring-diacritical-marks. I understand how to do this with `sortWith`, but I don’t understand how to do with with an `Ordering`.

In my case I want to use the `takeNsmallest` function many times, each time sorting them by distance to a different center point. So with `sortWith` I provide a closure which measures two given points by comparing their distance to the enclosed center point. So the sorting measure function is determined at run-time.