Any option to use something other than "time" to measure benchmarks?

Mon Jul 25 21:40:19 UTC 2016

On 07/25/2016 03:07 AM, Travis Downs wrote:
> When I reduce most of the variability by running at "max frequency" and
> disabling most OS power management, I still see a turbo-boost related
> effect where the first iterations of a benchmark run at a high (turbo)
> speed that later ones (which throttle down) - which also depends on factors
> like the ambient air temperature, and the type of pants I am wearing
> (believe it or not - when I wear fleece pants, it causes higher temps and
> turbo throttles down much more quickly).

To be blunt, it is a wishful thinking that you can throw the benchmark
against the wall of non-prepared hardware and it will produce meaningful
data.

This is a problem with the benchmarking environment, not with the metric
a harness uses. Software cannot fix a misbehaving hardware or
non-cooperating software stack. It is a known thing to watch CPU
frequency policy and thermals when doing benchmarking.

JMH even has a "JMH Core Benchmarks" package that assesses the
environment health, see e.g. runnable:

http://central.maven.org/maven2/org/openjdk/jmh/jmh-core-benchmarks/1.13/jmh-core-benchmarks-1.13-full.jar

> There is a near-panacea for all this: have an option to use a measurement
> which directly measures CPU cycles, rather than time. For example,
> something like "unhalted cpu cycles" offered by the performance counters on
> x86 chips. This solves most of the problems above since this counter
> "ticks" at the same speed as the CPU. Some variability remains because some
> parts of the system (notably, the latency to RAM) may not scale in the same
> way, but in general it is much better.
> 
> I don't know practical this is to do from JMH, so for now I'm just throwing
> the idea out there.

There _is_ support for measuring cycles taken per benchmark op:
perfnorm, which normalizes the perf counters per @Benchmark calls. This
metric is not primary, but it is available.

I don't believe measuring hardware cycles is panacea.

First, performance counters support (esp. from userland) differs within
hardware and OS variants. The known reliable thing is Linux/x86, with
Linux/ARM and Linux/POWER closing up. Time measurement, on the other
hand, is ubiquitously supported. It seems odd to require performance
counters to be accessible on a platform to do performance measurements.

Second, not-so-painfully-available PMU sampling is not as accurate as
timing measurement over many samples. For example, simple HelloWorld
test from Samples:

$ java -jar jmh-samples/target/benchmarks.jar Hello -f 5 -wi 5 -i 5 -tu
ns -bm avgt -prof perfnorm

Benchmark                              Mode  Cnt   Score    Error  Units
JMHSample_01_...wellHelloThere         avgt   25   0.252 ±  0.002  ns/op
JMHSample_01_...wellHelloThere:·cycles avgt    5   1.073 ±  0.043   #/op

So in absolute values, timing measurement is 4x lower, and you would
expect the error margin to be also 4x lower then, but instead it is 20x
lower!

This happens for several reasons:
 a) "cycles" perf is sampling, and so the absolute value is estimate;
its accuracy is proportional to sampling frequency; overhead is
proportional to that too!
 b) (harder) correlating external PMU counters that say "system did X
cycles from time1 to time2" and internal @Benchmark invocation counters
that say "we did Y ops from time3 to time4" (extend this to the number
of running working threads!) is an art too;

Synchronous per-thread timestamping works better. Granted, you may want
something like RDPMC for reading the counters directly, and use that
instead of timestamps. Assuming you somehow manhandled Java in doing
this, now you have to solve the thread affinity problem, because:
 a) threads can migrate, making cycles difference bogus;
 b) several threads may have to be scheduled to the same core (e.g. when
there are more threads than cores), at which point PMC difference has to
be shared between all participating threads;

etc, etc, etc.

Thanks,
-Aleksey