Sorting performance

Russ · February 2, 2024, 9:40pm

Suppose I have a Vector of tuples of the form

Vector[(Double, Double)]

and I need them sorted by the first “Double” as would be done with

vector.sortBy(_._1)

This operation will be done many times, and each Vector could be fairly long (hundreds or thousands of elements), so I am concerned about performance.

I am wondering about performance for the following cases:

The Vector was constructed in sorted order. Is the cost of sorting more than the cost of checking whether it is already in sorted order?
The Vector is in reverse order.
The Vector is in random order.

tarsa · February 2, 2024, 10:20pm

i think Timsort - Wikipedia does well on (nearly) sorted or reversed inputs and also isn’t slow on random inputs. it’s used in java: jdk/src/java.base/share/classes/java/util/TimSort.java at master · openjdk/jdk · GitHub

therefore you can convert scala vector to java array, sort the array using timsort from java and in the end convert java array back to scala vector.

Ichoran · February 2, 2024, 10:43pm

And this is in fact what Vector does under the hood: converts to an array, uses Java’s array sort (which is TimSort on object arrays), and packs back into a Vector. This makes every sort (at least) O(n), even the ones where it’s already perfectly sorted. (So, yes, you should check first before sorting if there’s a sizable chance that it’s already sorted.)

If you need to do very many updates, then it is probably better to keep it in permanently sorted order using a TreeSet[(Double, Double)] where every single-element update has log(N) cost.

(Note: missed the “at least” originally, which was completely incorrect since sorting takes at least O(n log n) in general.)

Russ · February 2, 2024, 10:53pm

Thanks. You said every sort is O(n), but I should check before sorting if there is a good chance it is already sorted. Are you implying that the cost of checking is less than O(n)? If so, what is it? And does Vector have a method for checking?

tarsa · February 3, 2024, 12:09am

how?
timsort has O(n lg n) worst and average case and O(n) best (nearly sorted, reverse sorted, etc) case.

Ichoran · February 3, 2024, 12:51am

Ack, what I meant was that the minimum time is O(n) even when there’s no work to do. Of course the typical time is O(n log n), and with TimSort it’s also the worst time as far as I know.

Ichoran · February 3, 2024, 12:56am

Checking is O(n). If you want it fast, get an iterator and do it manually.

if xs.length > 1 then:
  val i = xs.iterator
  var x = i.next
  var nondecreasing = true
  while nondecreasing && i.hasNext:
    val y = i.next
    nondecreasing = x <= y
    x = y
  nondecreasing
else
  true

There are functional ways to do it, but they’re slower.

Russ · February 3, 2024, 2:13am

Yes, I should have realized that checking is O(n) since I wrote one myself! It’s just a matter of stepping through and checking each element against the previous one. I guess it could be parallelized by breaking it up into groups, which might make sense for a very long Vector or List.

Russ · February 3, 2024, 5:55am

Let’s say one needs a sorted list as a starting point for a particular algorithm. That could be a requirement on input to the algorithm, or it could be done as the first step of the algorithm. The choice comes down to a trafeoff between safety and efficiency. Sorting it as the first step of the algorithm is safer, but it is less efficient if the input was already sorted.

One could check the input to determine if it is sorted, then sort it if it isn’t, but it seems that that would be less efficient that just sorting it. In other words, just sorting it would take no longer than checking first, then sorting only if necessary. Is that right, or am I missing something?

spamegg1 · February 3, 2024, 7:08am

Just sorting is O(n log(n)).
Checking is O(n).
Checking first, then sorting if necessary is:

O(n) if it’s already sorted,
O(n) + O(n log(n)) = O(n log(n)) if it’s not already sorted.

And… O(n) is smaller than O(n log(n)). But not by much for “small” input.
Would it make a noticeable difference for a few thousand elements? Not sure, probably not? For hundreds of thousands of elements? Yes. You might want to time some typical runs.

In order to “save significant time by checking first” you either have to be clairvoyant, or you need some statistics on how often your lists are “already sorted” so that “taking a chance and checking it but not sorting it because it’s already sorted” is likely to pay off big.

So… gather some statistics I guess! How often is it already sorted?

But… probably don’t do any of this. Ichoran’s suggestion of TreeSet is much better!

devlaam · February 3, 2024, 10:49am

Adding to all sensible remarks already made, the following. Such a data structure does not magically appear out of nowhere in RAM. If you have control over the location where it enters, usually that is place to do the sorting, because you have to take your punishment of O(n) for excepting each element anyway, sorting takes at most an extra O(log(n)). So at the end you still have O(n.log(n)), but with the least possible prefactor.

tarsa · February 3, 2024, 12:39pm

read about timsort. its complexity gets lower the more presorted the input is. if the input is already sorted (or reverse sorted) then the number of comparisons in timsort is O(n), so if you’re checking by yourself then you’ll gain no complexity order win (except maybe a complexity factor, because you could avoid converting vector to array and back).

timsort is a mergesort with uneven partitions. it detects sorted (and reverse sorted) subsequences in input and treats them as already sorted partitions. the more presortedness you have in the input, the less merge steps (and thus comparisons) are made. you can check that yourself by having a comparator that counts its invocations. for random input you should have n * lg n * small_factor comparisons, while for mostly sorted input (e.g. whole input is sorted except extra 100 random elements at the end) you should have n * small_factor comparisons.

tarsa · February 3, 2024, 12:45pm

tuples are (identity) objects (i.e. they are independently allocated on heap), so you pay performance penalty (memory access latency) for indirection when accessing them. you could theoretically make your own specialized sorting algorithm e.g. introsort for sorting your data, one that would accept a flat Array[Double] and treat pairs of numbers as one item (from sorting perspective). but that’s probably too big maintenance cost.

i’m not sure how parallelized (in scala parallel collections) are the conversions between vector and array. from quick look it seems that converting vector to array is parallelized, while converting array to vector isn’t. i may be wrong though.

java, however, has parallel sort already:
https://docs.oracle.com/javase/8/docs/api/java/util/Arrays.html#parallelSort-T:A-java.util.Comparator-

if your collections are of size of just hundreds of elements or few thousands then perhaps parallelization wouldn’t bring much benefit. inter thread synchronization has its cost and when a partition gets small enough it’s sorted sequentially anyway.

Ichoran · February 3, 2024, 7:48pm

TimSort is O(n) with sorted or nearly-sorted inputs.

But sorting in the best case is O(n) with a higher constant factor than checking, because the Vector sort algorithm is: copy everything to an Array, use TimSort (via java.util.Arrays.sort), build a new Vector with the answer.

So if you expect that checking will often reveal that the list is sorted, do your own check first.

If checking will only rarely reveal an already-sorted list, don’t bother. Just sort. It will be a more expensive O(n) in the rare case where it’s sorted, but it will still only be O(n).

Russ · February 3, 2024, 8:03pm

Thanks for that clarification. The input Vector is normally in sorted order when it is created. However, I recently called the algorithm with an input that was in reverse order. Not realizing that that would not work, I ended up spending a lot of time tracking down the source of the problem. I just want to be sure it doesn’t happen again in the future – long after I’ve forgotten about the issue. More importantly, I want to eliminate the chance of the error going undetected.

Based on what you are telling me, I will check before sorting (or perhaps throw an exception to alert the caller to sort before calling). After the code is tested, the check could be disabled for long production runs I suppose.