The simplest, dumbest threading model

sageserpent-open · December 30, 2024, 5:41pm

This should be an easy one … I have three steaming lumps of glorious imperative code that execute code to populate a cache based on entries already in the cache.

These lumps are essentially an imperative sequence of C-style nested loops over integers, that pull things out of the cache for some choices of keys and push the results back into the cache under derived keys.

This is not a simple Cartesian product of key indices, though.

So:

LOOP 1 > LOOP 1.1 > LOOP 1.2;
LOOP 2 > LOOP 2.1 > LOOP 2.2;
LOOP 3 > LOOP 3.1 > LOOP 3.2

makes one lump, and there are three lumps that work on completely distinct families of cache keys.

So no results are shared between the computations done in the lumps.

Each loop can work on circa 100 000 items. Heap and stack usage are bounded in each lump.

The cache is essentially a wrapper around two arrays. It couldn’t be more like C if it tried. There are no concurrency locks, as arrays are used and the keys don’t clash.

It has to run as fast as possible on a single desktop computer.

So if I wanted to run the lumps in parallel with as little overhead as possible, what would be the obvious way?

I could do the Cats thing and use IO, but that seems like overkill - I don’t need the cancellable model, and as memory isn’t an issue for each lump, I’m not concerned about a computation going wild, having side effects other than populating the cache or throwing exceptions. Think computation threads rather than asynchronous effects.

I do want to avoid spinning up threads via the Java API for each lump directly - there is an outer loop around the whole thing, and there is use of thread pooling going on elsewhere, so I’d like to join an existing thread pool.

I’m not terribly keen on going the Executor / ExecutionService route, so my current thinking for tomorrow is to make three Future instances and gang them together as an applicative functor to get parallel execution.

Any suggestions as to a more direct and performant approach?

sageserpent-open · December 30, 2024, 6:09pm

A self-answer - ForkJoinPool. I believe that Future uses that anyway, but perhaps there’s a saving to be had in accessing it directly.

I’ll give them both a try and benchmark.

sageserpent-open · December 31, 2024, 1:24pm

The epilogue…

I went for using Future, and the results were great.

These are from a pathologically long example merge I use to do manual testing:

The dynamic programming algorithm used by Kinetic Merge went from around 25 seconds prior to being prepared for parallelisation …

… to around 17 seconds once the aforementioned horrible C-style implementation went in …

… stayed at 17 seconds when I refactored the mess to make it tractable again (to my considerable surprise) …

… then went down to 6 seconds once Future was used to parallelise the lumps …

… getting to 5 seconds once IO was used to stitch together the future instance (I’m too stupid to write a catamorphism for Future, and Monad.whileM_ doesn’t get on well with Future).

One for @SethTisue to celebrate.

I shall close 2024 by making a release, and wish you all a Happy New Year.

Guten Rutsch!