RFD: Subsampled profile counters in HotSpot
Andrew Haley
aph-open at littlepinkcloud.com
Fri Feb 14 18:01:23 UTC 2025
This is JDK-8134940: TieredCompilation profiling can exhibit poor scalability.
(Thanks to Igor Veresov for the inspiration and advice!)
The poster child for awful profile scaling is SOR,
https://cr.openjdk.org/~redestad/scratch/sor-jmh/
Here is the TieredStopAtLevel=3 performance with just one thread:
Benchmark Mode Cnt Score Error Units
JGSOR.test avgt 3 9.144 ± 1.044 ms/op
and with 16 hardware threads:
JGSOR.test avgt 3 1177.982 ± 5.108 ms/op
So it's a 100-fold performance drop. This is a real problem I've
seen in production deployments.
I've been looking at the idea of incrementing profile counters less
frequently, recording events in a pseudo-random subsampled way. For
example, you could set the ProfileCaptureRatio=16 and then, at random,
only 1/16th of counter updates would be recorded.
In theory, those 1/16th of updates should be representative, but
theory is not necessarily the same as practice. Here's where
Statistics comes to our rescue, though. I am not a statistician, but I
think the central theorem of statistics applies. It "describes the
asymptotic behaviour of the empirical distribution function as the
number of independent and identically distributed observations grows.
Specifically, the empirical distribution function converges uniformly
to the true distribution function almost surely." (It doesn't say how
long that'll take, though.)
So, as long as the random-number generator we use is completely
uncorrelated with the process we're profiling, even a substantially
undersampled set of profile counters should converge to the same
ratios we'd have if not undersampling.
The poster child for awful profile scaling is SOR,
https://cr.openjdk.org/~redestad/scratch/sor-jmh/
Here is the TieredStopAtLevel=3 performance with just one thread:
Benchmark Mode Cnt Score Error Units
JGSOR.test avgt 3 9.144 ± 1.044 ms/op
and with 16 hardware threads:
JGSOR.test avgt 3 1177.982 ± 5.108 ms/op
So it's a 100-fold performance drop.
I have done a very rough proof-of-concept implementation of
subsampling. It's at
https://github.com/openjdk/jdk/pull/23643/files
It's not fit for anything much except to demonstrate the feasibility
of using this approach. While the code isn't great, I think that it
does fairly represent the performance we could expect if we decided to
go with this approach.
These are the JMH results for SOR, 16 threads with various subsampling
ratios, controlled by -XX:ProfileCaptureRatio=n:
n Benchmark Mode Cnt Score Error Units
1 JGSOR.test avgt 3 1177.982 ± 5.108 ms/op
2 JGSOR.test avgt 3 622.435 ± 101.466 ms/op
4 JGSOR.test avgt 3 310.496 ± 17.681 ms/op
8 JGSOR.test avgt 3 170.867 ± 0.911 ms/op
16 JGSOR.test avgt 3 98.210 ± 9.236 ms/op
32 JGSOR.test avgt 3 58.137 ± 3.501 ms/op
64 JGSOR.test avgt 3 35.384 ± 0.922 ms/op
128 JGSOR.test avgt 3 22.076 ± 0.197 ms/op
256 JGSOR.test avgt 3 15.459 ± 2.312 ms/op
1024 JGSOR.test avgt 3 10.180 ± 0.426 ms/op
With n=1. there is no undersampling at all, and we see the
catastrophic slowdown which is the subject of this bug report. The
performance improves rapidly, but not quite linearly, with increasing
subsampling ratios, as you'd expect.
/build/linux-x86_64-server-release/jdk/bin/java -jar ./build/linux-x86_64-server-release/images/test/micro/benchmarks.jar SOR -t 16 -f 1 -wi 3 -i 3 -r 1 -w 1 -jvmArgs ' -XX:TieredStopAtLevel=3 -XX:ProfileCaptureRatio=16'
Surprisingly, the overhead for randomized subsapling isn't so great.
Here's the speed of the same JGSOR.test with only 1 thread, with
various subsampling ratios:
1 JGSOR.test avgt 3 9.087 ± 0.041 ms/op (not undersampled)
2 JGSOR.test avgt 3 22.431 ± 0.079 ms/op
4 JGSOR.test avgt 3 14.291 ± 0.048 ms/op
8 JGSOR.test avgt 3 10.316 ± 0.021 ms/op
16 JGSOR.test avgt 3 9.360 ± 0.022 ms/op
32 JGSOR.test avgt 3 9.196 ± 0.042 ms/op
We can see that if we undersample 16-fold, then the single-threaded
overhead for profile counting is no worse than it is with no
undersampling at all. We could, at least in theory, ship HotSpot with
32-fold undersampling as a default, and no one would ever notice,
except that the poor scaling behaviour would be very much reduced.
However, there is a cost. C1 code size does increase because we have
to step the random-number generator at every profiling site. It's not
too bad, because the added code is just something like
mov ebx,0x41c64e6d
imul r14d,ebx
add r14d,0x3039
cmp r14d,0x1000000 // ProfileCaptureRatio
jae 0x00007fffe099c197
... profiling code
but it's definitely bigger.
The POC is C1 only, x86 only, and I haven't done anything about
profiling the interpreter. I'm sure it has bugs. It'll probably crash
if you push it too hard. But is very representative, I think, of how
well a finished implementation would perform.
--
Andrew Haley (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
More information about the hotspot-dev
mailing list