Trouble Measuring `Vector.head` with JMH

ansvonwa · June 28, 2024, 2:20pm

Hi there!

for a project, I need to measure performance of some methods e.g. Vector.head.
There is Vectorbenchmark2.nvHead, which measures exactly that.
So, when I run this, I get:

$ sbt "bench/clean; bench/Jmh/run -i 1 -wi 0 -p size=10 scala.collection.immutable.VectorBenchmark2.nvHead"
[info] welcome to sbt 1.10.0 (Ubuntu Java 17.0.11)
...
[info] running (fork) org.openjdk.jmh.Main -i 1 -wi 0 -p size=10 scala.collection.immutable.VectorBenchmark2.nvHead
[info] # JMH version: 1.37
[info] # VM version: JDK 17.0.11, OpenJDK 64-Bit Server VM, 17.0.11+9-Ubuntu-122.04.1
[info] # VM invoker: /usr/lib/jvm/java-17-openjdk-amd64/bin/java
[info] # VM options: <none>
[info] # Blackhole mode: compiler (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
[info] # Warmup: <none>
[info] # Measurement: 1 iterations, 10 s each
[info] # Timeout: 10 min per iteration
[info] # Threads: 1 thread, will synchronize iterations
[info] # Benchmark mode: Average time, time/op
[info] # Benchmark: scala.collection.immutable.VectorBenchmark2.nvHead
[info] # Parameters: (size = 10)
[info] # Run progress: 0.00% complete, ETA 00:00:20
[info] # Fork: 1 of 2
[info] Iteration   1: 19.205 ns/op
[info] # Run progress: 50.00% complete, ETA 00:00:10
[info] # Fork: 2 of 2
[info] Iteration   1: 19.533 ns/op
[info] Result "scala.collection.immutable.VectorBenchmark2.nvHead":
[info]   19.369 ns/op
...
[info] Benchmark                (size)  Mode  Cnt   Score   Error  Units
[info] VectorBenchmark2.nvHead      10  avgt    2  19.369          ns/op

(yeah, skipping warmup etc, but this does not matter for the point.)

So, nvHead takes supposedly ~19ns.

Also, when using other parameters for JMH, (e.g. -XX:-TieredCompilation) or adding @CompilerControl(CompilerControl.Mode.DONT_INLINE) or such, I get similar results, up to 40ns. (Even when on a completely different machine with an Intel Xeon CPU.)

So, where is the problem?

From looking at nvHead, we see that it calls coll.head 1000 times.
This is on my laptop with a Ryzen 7 PRO 4750U, of up to 1.7GHz. So the 19ns [resp. 40ns] are approximately 32 [resp. 68] CPU cycles for the 1000 method calls. This does not really make sense, so the whole call must have been optimized away, even though the result of .head is passed to the compiler blackhole.
Also, from SZeiger’s Results, we should expect something around 4-5ns per head call.

So, what am I missing?
How do I run the benchmark properly?

som-snytt · June 28, 2024, 7:55pm

I don’t know, but I’m up for my annual JMH refresher. I think last time I decided to use JMH regularly so I wouldn’t forget, and also to exercise several times a week, and both resolutions went in the same direction.

Using JDK 8 produces Zeiger-like numbers.

Consuming the loop variable, or using it in a conditional (such as, to test on even values), reduces optimization.

That’s as far as I dug today while waiting for other builds.

Worth adding that scalac optimizer probably does not matter in this case, but

setupPublishCore

is the local task that enables it. Probably it’s desirable to test the optimized library.

ansvonwa · June 29, 2024, 1:36am

Thank you, I guess I found roughly the problem and a workaround:
JMH seems to try to use a compiler-blackhole which somehow does not do what it should.
Setting -Djmh.blackhole.mode=FULL_DONTINLINE (either via set bench/javaOptions += "-Djmh.blackhole.autoDetect=false" or in an sbt file) sems to do the trick. Then, the “full” blackhole consisting of a proper class that actually consumes the values is used.

Instead of FULL_DONTINLINE, I could also use FULL, which gives sub-Zeiger results that are still plausible (at least if the inlined .head call together with the blackhole can be optimized down to 1 cycle).

Not sure which one is the more “realistic” variant.

Maybe we should set one of the variants in the build.sbt? Which?

Btw good hint with setupPublishCore, reminded me to enable optimization on my project.