Merging sorted lists

jimka · February 18, 2019, 2:50pm

When I look at the sequence and list documentation, I don’t see any method for inserting an element into a list assuming the list is already sorted (which should have linear complexity), or given two sorted lists (e.g, one of length 1 and one of length n) and merging the two lists (which again should have linear complexity). Am I missing something?

Of course I can implement such a function for myself, but if one already exists in the standard library, I’d rather not re-invent the wheel.

PeterE · February 23, 2019, 1:40pm

Concerning List, you can insert elements at a given index by patch. There exists no method to insert an element based on an ordering so you need to call indexWhere first.

However, if you need to keep your collection sorted, you better choose not List but a sorted collection like SortedSet unless your collection is small. You cannot insert duplicates into a set, though.

Jasper-M · February 23, 2019, 4:47pm

You should take a look at the Searching scaladoc.

jducoeur · February 23, 2019, 10:17pm

That’s really not obvious – it depends on the usage. Merging two already-sorted Lists is a nice cheap O(n) operation, so if random inserts (which are also a relatively expensive O(n)) are rare, List might be a good way to go.

jimka · February 24, 2019, 9:50am

Thanks for the suggestion. I also thought about that. But I’m not sure about the complexity of SortedSet. Is really kept sorted, or does it only sort when I try to look at it. For example, If I insert 10000 times, and read once, it should only sort once, not 10000 times. This is why I’m using List, so I can control when the sorting occurs.

Inserting one element into a sorted list has linear complexity. But sorting N elements has quadratic complexity, while sorting the list will have n.log(n) complexity.

Come to think about it, if the SortedSet has the property that insertion has log(n) complexity, then that might indeed be better for my needs, because inserting many elements would be no slower than re-sorting the list afterwards.

This distinction is actually important for my application, because the lists might be HUGE in reality, even if the test cases are small.

tarsa · February 24, 2019, 10:37am

Collections’ performance is summarized here: https://docs.scala-lang.org/overviews/collections/performance-characteristics.html (although it’s not written to which Scala version this table applies)

I would say the constant complexity factor in adding elements one by one to SortedSet is much higher than constructing a whole List and then sorting it in one go. But that’s only one very specific scenario. If there are many more sorting phases then keeping the collection always sorted should be more efficient. If performance is crucial for you - benchmark it! Use e.g.: https://github.com/ktoso/sbt-jmh for benchmarking.

LannyRipple · February 26, 2019, 2:37pm

If you just need to dump all your data into a box and get it out in order (i.e., you don’t need to look in the box until the end) be sure and check out heap solutions. There’s also scala.util.Sorting if you follow @tarsa’s advice and just sort once at the end.

jimka · February 26, 2019, 2:47pm

@LannyRipple, thanks for the suggestion. However, in my case that interesting simplification does not apply. In my case I need to do some additions, but not all in one place, rather at various points during a long computations, but from time to time, I need to read back out the items, and to assure they are in sorted order.

If you are interested, the actual algorithm is the Quine-McCluskey Boolean simplification algorithm, for reducing n-variable Boolean equations efficiently. The QM method makes certain classes of Boolean functions reducible linearly to quadratic, by breaking up a large list of Boolean clauses in a many small groups, each of which can then be operated on in quadratic or n.log time. After the small groupings are done, “adjacent” groups are compared quadratically, and elements are removed, reduced, and added to other groups. The iteration (exchange between groups) continues until a fixed point is reached.

szeiger · February 26, 2019, 3:03pm

BTW, any kind of explicit sorting that is provided by the standard library is done with arrays. Sorting an array or a mutable array-based collection in place has the lowest overhead. Sorting an immutale.ArraySeq is still quite good because it’s only one additional block copy for the array. Sorting any other data structure like a list is done by copying all elements to an array, sorting the array, and then building a new list.

PeterE · March 2, 2019, 9:35am

immutable.SortedSet is implemented by https://www.scala-lang.org/api/current/scala/collection/immutable/TreeSet.html while
mutable.SortedSet by https://www.scala-lang.org/api/current/scala/collection/mutable/TreeSet.html

Both are specialized trees so inserting values is predictably fast: it just involves a tree traversal of at most depth 7 or comparably low. Sure significantly faster than inserting into a big List.