RFD: Subsampled profile counters in HotSpot

Fri Feb 14 18:01:23 UTC 2025

This is JDK-8134940: TieredCompilation profiling can exhibit poor scalability.

(Thanks to Igor Veresov for the inspiration and advice!)

The poster child for awful profile scaling is SOR,
https://cr.openjdk.org/~redestad/scratch/sor-jmh/

Here is the TieredStopAtLevel=3 performance with just one thread:

Benchmark   Mode  Cnt  Score   Error  Units
JGSOR.test  avgt    3  9.144 ± 1.044  ms/op

and with 16 hardware threads:

JGSOR.test  avgt    3  1177.982 ± 5.108  ms/op

So it's a 100-fold performance drop. This is a real problem I've
seen in production deployments.

I've been looking at the idea of incrementing profile counters less
frequently, recording events in a pseudo-random subsampled way. For
example, you could set the ProfileCaptureRatio=16 and then, at random,
only 1/16th of counter updates would be recorded.

In theory, those 1/16th of updates should be representative, but
theory is not necessarily the same as practice. Here's where
Statistics comes to our rescue, though. I am not a statistician, but I
think the central theorem of statistics applies. It "describes the
asymptotic behaviour of the empirical distribution function as the
number of independent and identically distributed observations grows.
Specifically, the empirical distribution function converges uniformly
to the true distribution function almost surely." (It doesn't say how
long that'll take, though.)

So, as long as the random-number generator we use is completely
uncorrelated with the process we're profiling, even a substantially
undersampled set of profile counters should converge to the same
ratios we'd have if not undersampling.

The poster child for awful profile scaling is SOR,
https://cr.openjdk.org/~redestad/scratch/sor-jmh/

Here is the TieredStopAtLevel=3 performance with just one thread:

Benchmark   Mode  Cnt  Score   Error  Units
JGSOR.test  avgt    3  9.144 ± 1.044  ms/op

and with 16 hardware threads:

JGSOR.test  avgt    3  1177.982 ± 5.108  ms/op

So it's a 100-fold performance drop.

I have done a very rough proof-of-concept implementation of
subsampling. It's at
https://github.com/openjdk/jdk/pull/23643/files
It's not fit for anything much except to demonstrate the feasibility
of using this approach. While the code isn't great, I think that it
does fairly represent the performance we could expect if we decided to
go with this approach.

These are the JMH results for SOR, 16 threads with various subsampling
ratios, controlled by -XX:ProfileCaptureRatio=n:

n   Benchmark   Mode  Cnt   Score   Error  Units

1   JGSOR.test  avgt    3  1177.982 ± 5.108  ms/op
2   JGSOR.test  avgt    3  622.435 ± 101.466  ms/op
4   JGSOR.test  avgt    3  310.496 ± 17.681  ms/op
8   JGSOR.test  avgt    3  170.867 ± 0.911  ms/op
16  JGSOR.test  avgt    3  98.210 ± 9.236  ms/op
32  JGSOR.test  avgt    3  58.137 ± 3.501  ms/op
64  JGSOR.test  avgt    3  35.384 ± 0.922  ms/op
128 JGSOR.test  avgt    3  22.076 ± 0.197  ms/op
256 JGSOR.test  avgt    3  15.459 ± 2.312  ms/op
1024 JGSOR.test avgt    3  10.180 ± 0.426  ms/op

With n=1. there is no undersampling at all, and we see the
catastrophic slowdown which is the subject of this bug report. The
performance improves rapidly, but not quite linearly, with increasing
subsampling ratios, as you'd expect.

/build/linux-x86_64-server-release/jdk/bin/java -jar ./build/linux-x86_64-server-release/images/test/micro/benchmarks.jar SOR -t 16 -f 1 -wi 3 -i 3 -r 1 -w 1 -jvmArgs ' -XX:TieredStopAtLevel=3 -XX:ProfileCaptureRatio=16'

Surprisingly, the overhead for randomized subsapling isn't so great.
Here's the speed of the same JGSOR.test with only 1 thread, with
various subsampling ratios:

1   JGSOR.test  avgt    3  9.087 ± 0.041  ms/op (not undersampled)
2   JGSOR.test  avgt    3  22.431 ± 0.079  ms/op
4   JGSOR.test  avgt    3  14.291 ± 0.048  ms/op
8   JGSOR.test  avgt    3  10.316 ± 0.021  ms/op
16  JGSOR.test  avgt    3  9.360 ± 0.022  ms/op
32  JGSOR.test  avgt    3  9.196 ± 0.042  ms/op

We can see that if we undersample 16-fold, then the single-threaded
overhead for profile counting is no worse than it is with no
undersampling at all. We could, at least in theory, ship HotSpot with
32-fold undersampling as a default, and no one would ever notice,
except that the poor scaling behaviour would be very much reduced.

However, there is a cost. C1 code size does increase because we have
to step the random-number generator at every profiling site. It's not
too bad, because the added code is just something like

    mov    ebx,0x41c64e6d
    imul   r14d,ebx
    add    r14d,0x3039
    cmp    r14d,0x1000000   // ProfileCaptureRatio
    jae    0x00007fffe099c197
    ... profiling code

but it's definitely bigger.

The POC is C1 only, x86 only, and I haven't done anything about
profiling the interpreter. I'm sure it has bugs. It'll probably crash
if you push it too hard. But is very representative, I think, of how
well a finished implementation would perform.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671