Fwd: JMH score stability

Sat Apr 30 12:52:22 UTC 2022

FYI,
I am starting a discussion on:
- jmh score variability
-sampled timings on micro-benchmarking are more a log-normal than a normal
distribution.

I think it is very important for cpu/mem benchmarks where latency is not
the relevant metric.

Cheers,
Laurent

---------- Forwarded message ---------
De : Laurent Bourgès <bourges.laurent at gmail.com>
Date: sam. 30 avr. 2022, 14:46
Subject: JMH score stability
To: <aleksey at shipilev.net>, Vladimir Yaroslavskiy <vlv.spb.ru at mail.ru>

Dear Alexey,

I am currently working on jmh benchmarks to estimate performance changes
among DPQS variants.
I noticed few issues with jmh 1.31:
- average time gives score +/- 99.9 error but this estimator is quite
unstable as the distribution is log-normal or long-tailed... so gc or jit
latency makes scores varying quite largely when the iteration count,
duration or number of forks vary
- sampled time is better as it gives min, percentiles at 50, 90, 99.9 ...
but its overall aggregated score and error is biased too in the sense that
the score can be quite far from median value.

I managed extracting benchmark's raw sampled values [score : count] so I am
computing myself the global score and its std dev using only times < 95%
percentiles threshold.
Such statistics discards outliers and laurent's score is a lot more stable.

Alexey, what do you think about this approach ? Maybe having a new cut-off
threshold like 0.9 of 0.99 could make jmh's aggregated score better for
cpu/mem benchmarks where 99.999 latency is not the important metric.

Finally I wonder if I could fit a lig-normal distrib (using apache maths GA
fitter) as samples are absolutely NOT a gaussian distribution in my sorting
contest.

Cheers,
Laurent