Hi there!
for a project, I need to measure performance of some methods e.g. Vector.head
.
There is Vectorbenchmark2.nvHead
, which measures exactly that.
So, when I run this, I get:
$ sbt "bench/clean; bench/Jmh/run -i 1 -wi 0 -p size=10 scala.collection.immutable.VectorBenchmark2.nvHead"
[info] welcome to sbt 1.10.0 (Ubuntu Java 17.0.11)
...
[info] running (fork) org.openjdk.jmh.Main -i 1 -wi 0 -p size=10 scala.collection.immutable.VectorBenchmark2.nvHead
[info] # JMH version: 1.37
[info] # VM version: JDK 17.0.11, OpenJDK 64-Bit Server VM, 17.0.11+9-Ubuntu-122.04.1
[info] # VM invoker: /usr/lib/jvm/java-17-openjdk-amd64/bin/java
[info] # VM options: <none>
[info] # Blackhole mode: compiler (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
[info] # Warmup: <none>
[info] # Measurement: 1 iterations, 10 s each
[info] # Timeout: 10 min per iteration
[info] # Threads: 1 thread, will synchronize iterations
[info] # Benchmark mode: Average time, time/op
[info] # Benchmark: scala.collection.immutable.VectorBenchmark2.nvHead
[info] # Parameters: (size = 10)
[info] # Run progress: 0.00% complete, ETA 00:00:20
[info] # Fork: 1 of 2
[info] Iteration 1: 19.205 ns/op
[info] # Run progress: 50.00% complete, ETA 00:00:10
[info] # Fork: 2 of 2
[info] Iteration 1: 19.533 ns/op
[info] Result "scala.collection.immutable.VectorBenchmark2.nvHead":
[info] 19.369 ns/op
...
[info] Benchmark (size) Mode Cnt Score Error Units
[info] VectorBenchmark2.nvHead 10 avgt 2 19.369 ns/op
(yeah, skipping warmup etc, but this does not matter for the point.)
So, nvHead
takes supposedly ~19ns.
Also, when using other parameters for JMH, (e.g. -XX:-TieredCompilation
) or adding @CompilerControl(CompilerControl.Mode.DONT_INLINE)
or such, I get similar results, up to 40ns. (Even when on a completely different machine with an Intel Xeon CPU.)
So, where is the problem?
From looking at nvHead
, we see that it calls coll.head
1000 times.
This is on my laptop with a Ryzen 7 PRO 4750U, of up to 1.7GHz. So the 19ns [resp. 40ns] are approximately 32 [resp. 68] CPU cycles for the 1000 method calls. This does not really make sense, so the whole call must have been optimized away, even though the result of .head
is passed to the compiler blackhole.
Also, from SZeiger’s Results, we should expect something around 4-5ns per head
call.
So, what am I missing?
How do I run the benchmark properly?