Can someone suggest the fastest way to take the N minimum elements from a List, assuming the length of the list is much greater than N. In my case N=10 and the list length is about 8500.

Currently I’m sorting the array and calling take(10), but it turns out this loop is in the critical path of my program.

And by the way, I’m referring to minimum w.r.t. a measure function.

data.sortBy(measure).take(n)

I’m tempted to copy the list into a mutable Array, and then implement in-place selectionSort which stops after it has swapped N elements, but I was hoping there was a more elegant way.

I assume that using a View is sufficient for not doing more sorting than necessary.

data.view.sortBy(measure).take(n).toList

For some reason you need a SeqView, which is what you’ll get when calling view on a List. So as long as you’re actually working with a List or some other Seq you should be good to go.

What I’m currently using is the following, which speeds up my program by an order of magnitude. This sort of uses a binary search to find the n least elements. Although the code is pretty tedious, and is certain not to work in cases where n is large compared to the length of the list.

def takeNsmallest[T](n:Int, data:List[T], measure:T=>Double):List[T] = {
val pivot :: tail = data
val reference = measure(pivot)
val (small, large) = tail.partition( measure(_) <= reference)
val len = small.length
if (len == n - 1)
pivot :: small
else if (len > n - 1)
takeNsmallest(n, small, measure)
else
pivot :: small ++ takeNsmallest(n - len - 1, large, measure)
}

Expanding on this answer a bit. A View is lazy so it should in theory only compute the amount of elements that it needs at the point when you call toList (or something else that materializes it) on it. In the case of sorting I think there is a bit of a tradeoff between sorting efficiently (faster than O(n²)) and sorting lazily. In this implementation efficiency was chosen. Which, ironically, in your usecase is not very efficient.

yes, but even if it were perfectly lazy, you can’t can’t find the first N sorted elements without sorting. Right? I.e., the compare function is only called insidesortBy. Any element which the compare function didn’t get called with might be the smallest element.

In the most naive implementation, you could find the n smallest elements in a m sized List with n traversals. You don’t need to sort all m elements. But it would probably only be helpful when n is very very small in comparison to m.

If you do selection sort on a list of 10_000_000 elements but only need the 5 smallest, you can stop sorting after 5 iterations.

import scala.collection.mutable.ArrayBuffer
import scala.collection.View
class SortedIterator[A](arr: ArrayBuffer[A], ord: Ordering[A]) extends Iterator[A] {
private[this] var pointer = 0
def next(): A = {
if (hasNext) {
var iMin = pointer
var min = arr(iMin)
var i = pointer
while (i < arr.length) {
if (ord.lt(arr(i), min)) {
iMin = i
min = arr(iMin)
}
i += 1
}
arr(iMin) = arr(pointer)
arr(pointer) = min
pointer += 1
min
}
else throw new NoSuchElementException()
}
def hasNext: Boolean = pointer < arr.length
override def knownSize: Int = arr.length - pointer
}
implicit class SortedSyntax[A](coll: collection.IterableOnce[A]) {
def sortedIterator(implicit ord: Ordering[A]): Iterator[A] = new SortedIterator(coll.iterator.to(ArrayBuffer), ord)
def sortedView(implicit ord: Ordering[A]): View[A] = new View[A] {
def iterator = sortedIterator
}
}

Then list.sortedIterator.take(5) is noticably faster for a large enough list.
You obviously have to force the entire underlying collection, but the sorting itself happens lazily (and, obviously, not efficiently if you would sort the whole list like this).

@sangamon’s solution is more or less the way I would do it, provided that mins.last and mins.size are O(1). If not (I don’t know TreeSet well enough to be sure), I might maintain the current high-water mark and the current size as separate values, to ensure O(1) behavior for each step unless you want to add it to the collection.

Regardless of the details, sorting seems unnecessary and very expensive. Given a full List of length N, and seeking the M smallest, it seems like you should be able to do this in no worse than O(M*N) (and probably much better) if you do some variation of this solution, scanning the List once and maintaining separately the collection of smallest values…

I think this could also be done with a heap data structure. makeHeap has complexity O(n), and removeMin can be done in O(log n), so that would be O(n + m log(n)) = O(n). Right?

This is something I’ve never really understood. How can I sort the same data using different measure functions using an Ordering? I.e., the order depends on the application, not the data itself. I.e., I might want to sort strings by length, or alphabetically, or alphabetically-case-independent, or alphabetically-ignoring-diacritical-marks. I understand how to do this with sortWith, but I don’t understand how to do with with an Ordering.

In my case I want to use the takeNsmallest function many times, each time sorting them by distance to a different center point. So with sortWith I provide a closure which measures two given points by comparing their distance to the enclosed center point. So the sorting measure function is determined at run-time.