Find the N minimum elements

jimka · November 28, 2019, 12:35pm

Can someone suggest the fastest way to take the N minimum elements from a List, assuming the length of the list is much greater than N. In my case N=10 and the list length is about 8500.

Currently I’m sorting the array and calling take(10), but it turns out this loop is in the critical path of my program.

And by the way, I’m referring to minimum w.r.t. a measure function.

data.sortBy(measure).take(n)

I’m tempted to copy the list into a mutable Array, and then implement in-place selectionSort which stops after it has swapped N elements, but I was hoping there was a more elegant way.

Jasper-M · November 28, 2019, 12:49pm

I assume that using a View is sufficient for not doing more sorting than necessary.

data.view.sortBy(measure).take(n).toList

For some reason you need a SeqView, which is what you’ll get when calling view on a List. So as long as you’re actually working with a List or some other Seq you should be good to go.

jimka · November 28, 2019, 12:59pm

I’d love to hear what that suggestion does? How does it find the minimum n elements without sorting the entire list?

Jasper-M · November 28, 2019, 1:01pm

I just did a small test, and I suspect that it doesn’t, which is a bit unfortunate.
I think it should be able to if it is sufficiently lazy.

jimka · November 28, 2019, 1:01pm

What I’m currently using is the following, which speeds up my program by an order of magnitude. This sort of uses a binary search to find the n least elements. Although the code is pretty tedious, and is certain not to work in cases where n is large compared to the length of the list.

  def takeNsmallest[T](n:Int, data:List[T], measure:T=>Double):List[T] = {
    val pivot :: tail = data
    val reference = measure(pivot)
    val (small, large) = tail.partition( measure(_) <= reference)
    val len = small.length
    if (len == n - 1) 
      pivot :: small
    else if (len > n - 1) 
      takeNsmallest(n, small, measure)
    else 
      pivot :: small ++ takeNsmallest(n - len - 1, large, measure)
  }

Jasper-M · November 28, 2019, 1:13pm

Expanding on this answer a bit. A View is lazy so it should in theory only compute the amount of elements that it needs at the point when you call toList (or something else that materializes it) on it. In the case of sorting I think there is a bit of a tradeoff between sorting efficiently (faster than O(n²)) and sorting lazily. In this implementation efficiency was chosen. Which, ironically, in your usecase is not very efficient.

jimka · November 28, 2019, 1:18pm

yes, but even if it were perfectly lazy, you can’t can’t find the first N sorted elements without sorting. Right? I.e., the compare function is only called inside sortBy. Any element which the compare function didn’t get called with might be the smallest element.

Jasper-M · November 28, 2019, 1:26pm

In the most naive implementation, you could find the n smallest elements in a m sized List with n traversals. You don’t need to sort all m elements. But it would probably only be helpful when n is very very small in comparison to m.

If you do selection sort on a list of 10_000_000 elements but only need the 5 smallest, you can stop sorting after 5 iterations.

jimka · November 28, 2019, 1:39pm

YIKES!!! my code is wrong. I’m using span rather than partition

I can never remember the names of those functions I want to partition the list into those <= pivot and those > pivot.

Good thing I wrong some test cases

sangamon · November 28, 2019, 2:50pm

Just a naive idea, untested and no idea how it compares to other approaches performance-wise.

def minSet[T : Ordering](data: Seq[T], count: Int): SortedSet[T] = {
  import Ordering.Implicits._
  def step(mins: TreeSet[T], cur: T): TreeSet[T] = {
    if(mins.size < count)
      mins + cur
    else {
      if(mins.last > cur)
        mins - mins.last + cur
      else
        mins
    }
  }
  data.foldLeft(TreeSet.empty[T])(step)
}

Jasper-M · November 28, 2019, 4:09pm

This was more or less what I was thinking of:

import scala.collection.mutable.ArrayBuffer
import scala.collection.View

class SortedIterator[A](arr: ArrayBuffer[A], ord: Ordering[A]) extends Iterator[A] {
  private[this] var pointer = 0
  def next(): A = {
    if (hasNext) {
      var iMin = pointer
      var min = arr(iMin)
      var i = pointer
      while (i < arr.length) {
        if (ord.lt(arr(i), min)) {
          iMin = i
          min = arr(iMin)
        }
        i += 1
      }
      arr(iMin) = arr(pointer)
      arr(pointer) = min
      pointer += 1
      min
    }
    else throw new NoSuchElementException()
  }
  def hasNext: Boolean = pointer < arr.length
  override def knownSize: Int = arr.length - pointer
}

implicit class SortedSyntax[A](coll: collection.IterableOnce[A]) { 
  def sortedIterator(implicit ord: Ordering[A]): Iterator[A] = new SortedIterator(coll.iterator.to(ArrayBuffer), ord) 
  def sortedView(implicit ord: Ordering[A]): View[A] = new View[A] {
    def iterator = sortedIterator
  }
}

Then list.sortedIterator.take(5) is noticably faster for a large enough list.
You obviously have to force the entire underlying collection, but the sorting itself happens lazily (and, obviously, not efficiently if you would sort the whole list like this).

jducoeur · November 28, 2019, 6:42pm

@sangamon’s solution is more or less the way I would do it, provided that mins.last and mins.size are O(1). If not (I don’t know TreeSet well enough to be sure), I might maintain the current high-water mark and the current size as separate values, to ensure O(1) behavior for each step unless you want to add it to the collection.

Regardless of the details, sorting seems unnecessary and very expensive. Given a full List of length N, and seeking the M smallest, it seems like you should be able to do this in no worse than O(M*N) (and probably much better) if you do some variation of this solution, scanning the List once and maintaining separately the collection of smallest values…

sangamon · November 28, 2019, 6:56pm

At a quick glance, should be O(1) for #size and O(log n) for #last.

I’ve just thrown @jimka’s initial sort, @Jasper-M’s iterator and my fold at jmh.

Benchmark                  Mode  Cnt      Score     Error  Units
NMinBench.benchFold       thrpt   10  13923.095 ± 156.952  ops/s
NMinBench.benchIter       thrpt   10   5262.884 ±  34.256  ops/s
NMinBench.benchPlainSort  thrpt   10    749.731 ±   4.889  ops/s

I still haven’t tested my implementation, yet, though.

jimka · November 28, 2019, 10:05pm

benchFold, that must be minSet ?
Will minSet work with a measure function like takeNsmallest and sortBy ?
Is high score good? What is ops/s ?

How do they compare to takeNsmallest?

jimka · November 28, 2019, 10:30pm

I think this could also be done with a heap data structure. makeHeap has complexity O(n), and removeMin can be done in O(log n), so that would be O(n + m log(n)) = O(n). Right?

jimka · November 29, 2019, 7:27am

@sangamon, are you sure these functions actually work? The minimum two elements of List(1,2,1) should be List(1,1), not List(1,2).

coreyoconnor · November 29, 2019, 8:03am

This looks to be an instance of the Partial Sort problem: https://en.wikipedia.org/wiki/Partial_sorting

Which, IIRC, the specialized quicksort works well.

There is an interesting discussion of how a laziness enables a (pseudo?) quicksort to do this auto-magically here: https://apfelmus.nfshost.com/articles/quicksearch.html

sangamon · November 29, 2019, 11:14am

To the contrary, I’m sure I said twice that I haven’t really tested it.

Yes, using a sorted set, this approach doesn’t fly. Swapping out the set for a priority queue (as you already have suggested) works, though.

def minSet[T : Ordering](data: Seq[T], count: Int): Seq[T] = {
  import Ordering.Implicits._
  def step(mins: mut.PriorityQueue[T], cur: T): mut.PriorityQueue[T] = {
    if(mins.size < count)
      mins.enqueue(cur)
    else {
      if(mins.head > cur) {
        mins.dequeue()
        mins.enqueue(cur)
      }
      else
        ()
    }
    mins
  }
  data.foldLeft(new mut.PriorityQueue[T]())(step).toSeq
}

Tested, this time.

forAll(listOf(arbitrary[Int]), posNum[Int]) { (data, count) =>
  whenever(count >= 0) {
    MinSet.minSet(data, count) should contain theSameElementsAs data.sorted.take(count)
  }
}

sangamon · November 29, 2019, 11:26am

Yup.

Even better, it takes an Ordering.

Operations per second. Higher is better.

New run, including takeNSmallest and the fixed, priority queue based version of the “fold” approach.

Benchmark                  Mode  Cnt      Score     Error  Units
NMinBench.benchFold       thrpt   10  15913.725 ± 196.195  ops/s
NMinBench.benchIter       thrpt   10   4822.864 ±  43.467  ops/s
NMinBench.benchNSmallest  thrpt   10   3411.451 ± 647.105  ops/s
NMinBench.benchPlainSort  thrpt   10    740.088 ±  13.779  ops/s

Code here.

jimka · December 1, 2019, 7:37am

This is something I’ve never really understood. How can I sort the same data using different measure functions using an Ordering? I.e., the order depends on the application, not the data itself. I.e., I might want to sort strings by length, or alphabetically, or alphabetically-case-independent, or alphabetically-ignoring-diacritical-marks. I understand how to do this with sortWith, but I don’t understand how to do with with an Ordering.

In my case I want to use the takeNsmallest function many times, each time sorting them by distance to a different center point. So with sortWith I provide a closure which measures two given points by comparing their distance to the enclosed center point. So the sorting measure function is determined at run-time.