JMH measurement problem for a specific case (impossible throughput results; no respect to the iteration duration limit)

Tue May 21 10:32:12 UTC 2019

On 5/21/19 5:05 AM, Valentin Kovalenko wrote:
> @Benchmark
> public final void writeLockSleep(final BenchmarkState s) throws
> InterruptedException {//BenchmarkState is Scope.Benchmark
>   final ReentrantReadWriteLock.WriteLock wLock = s.rwLock.writeLock();
>   wLock.lock();
>   try {
>     Thread.sleep(1000);
>   } finally {
>     wLock.unlock();
>   }
> }
> 
> Since the write lock is shared between all benchmark threads
> (BenchmarkState is annotated with @Scope.Benchmark) the throughput of this
> benchmark can't be greater than 1 ops/s. However, not according to JMH:

Let's simplify it a whole lot and add the equivalent logging patch in JMH infra, which would tell
the JMH perspective on this workload: http://cr.openjdk.java.net/~shade/jmh/lock-perf.patch

If you run it...

$ java -jar jmh-samples/target/benchmarks.jar TmpTest.writeLock -t 2 -f 1 -w 1s -r 10s -si false
# JMH version: 1.22-SNAPSHOT
# VM version: JDK 1.8.0_212, OpenJDK 64-Bit Server VM, 25.212-b03
# VM invoker: /home/shade/.sdkman/candidates/java/8.0.212.hs-adpt/jre/bin/java
# VM options: <none>
# Warmup: 5 iterations, 1 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 2 threads, ***WARNING: Synchronize iterations are disabled!***
# Benchmark mode: Throughput, ops/time
# Benchmark: org.openjdk.jmh.samples.TmpTest.writeLockSleep

# Run progress: 0.00% complete, ETA 00:00:55
# Fork: 1 of 1
# Warmup Iteration   1:
Thread[org.openjdk.jmh.samples.TmpTest.writeLockSleep-jmh-worker-1,5,main] Raw Results: allOps=1,
measuredOps=1, time=2000
Thread[org.openjdk.jmh.samples.TmpTest.writeLockSleep-jmh-worker-2,5,main] Raw Results: allOps=1,
measuredOps=1, time=1000
1.500 ops/s

See what happened here? Second thread did do 1 measured invocation and spent 1000ms there, got
termination signal and exited the measurement. Once it did, first thread was able to acquire the
lock, do its own operation (sleep), and that took 2000ms (both for waiting the lock acquisition and
its own op). The overall throughput is (1 / 2000ms) = 0.5 ops/sec from 1 thread, and (1 / 1000ms) =
1 ops/sec from the second thread, yielding the proper total.

The pitfall in this experiment is relativity in termination protocol: you might think the throughput
should be 1 ops/sec, because iteration "stops" momentarily in all threads, but that premise is
incorrect. All threads react to termination at different times.

In this workload, threads cannot stop mid-wait for writeLock acquisition. And, in fact, the time
they are spending waiting for that acquisition is part of overall performance. All stalled workers
would need to finally acquire the lock to get at least one operation done to get that time accounted
for. In other words, their measurement phase would inevitably be prolonged, and whatever minuscule
effective throughput they would get from that would add up to total throughput.

There are additional effects due to "synchronized iterations" when one of the threads exits the
measurement, but still does invocations, effectively starving the other thread from acquiring the
lock -- that explains longer iterations. Making the lock fair would prevent the gross disbalance
too, which would smooth it out.

Seems like you stepped on another landmine in coarse-grained blocking benchmarks. I see no clear way
out of this yet, except for making the sleep granularity lower or increasing iteration time, so that
termination effects are smoothed out. Maybe we can come up with smarter adjustment scheme that sums
up individual throughputs while accounting the different measured intervals, without throwing
asymmetric and non-blocking benchmarks under the bus.

Lemme think about it.

-Aleksey