RFD: Subsampled profile counters in HotSpot
Vladimir Kozlov
vladimir.kozlov at oracle.com
Fri Feb 14 18:14:07 UTC 2025
Thank you, Andrew, for sharing this data from your experiment.
This is indeed very promising.
Thanks,
Vladimir K
On 2/14/25 10:01 AM, Andrew Haley wrote:
> This is JDK-8134940: TieredCompilation profiling can exhibit poor
> scalability.
>
> (Thanks to Igor Veresov for the inspiration and advice!)
>
> The poster child for awful profile scaling is SOR,
> https://cr.openjdk.org/~redestad/scratch/sor-jmh/
>
> Here is the TieredStopAtLevel=3 performance with just one thread:
>
> Benchmark Mode Cnt Score Error Units
> JGSOR.test avgt 3 9.144 ± 1.044 ms/op
>
> and with 16 hardware threads:
>
> JGSOR.test avgt 3 1177.982 ± 5.108 ms/op
>
> So it's a 100-fold performance drop. This is a real problem I've
> seen in production deployments.
>
> I've been looking at the idea of incrementing profile counters less
> frequently, recording events in a pseudo-random subsampled way. For
> example, you could set the ProfileCaptureRatio=16 and then, at random,
> only 1/16th of counter updates would be recorded.
>
> In theory, those 1/16th of updates should be representative, but
> theory is not necessarily the same as practice. Here's where
> Statistics comes to our rescue, though. I am not a statistician, but I
> think the central theorem of statistics applies. It "describes the
> asymptotic behaviour of the empirical distribution function as the
> number of independent and identically distributed observations grows.
> Specifically, the empirical distribution function converges uniformly
> to the true distribution function almost surely." (It doesn't say how
> long that'll take, though.)
>
> So, as long as the random-number generator we use is completely
> uncorrelated with the process we're profiling, even a substantially
> undersampled set of profile counters should converge to the same
> ratios we'd have if not undersampling.
>
> The poster child for awful profile scaling is SOR,
> https://cr.openjdk.org/~redestad/scratch/sor-jmh/
>
> Here is the TieredStopAtLevel=3 performance with just one thread:
>
> Benchmark Mode Cnt Score Error Units
> JGSOR.test avgt 3 9.144 ± 1.044 ms/op
>
> and with 16 hardware threads:
>
> JGSOR.test avgt 3 1177.982 ± 5.108 ms/op
>
> So it's a 100-fold performance drop.
>
> I have done a very rough proof-of-concept implementation of
> subsampling. It's at
> https://github.com/openjdk/jdk/pull/23643/files
> It's not fit for anything much except to demonstrate the feasibility
> of using this approach. While the code isn't great, I think that it
> does fairly represent the performance we could expect if we decided to
> go with this approach.
>
> These are the JMH results for SOR, 16 threads with various subsampling
> ratios, controlled by -XX:ProfileCaptureRatio=n:
>
> n Benchmark Mode Cnt Score Error Units
>
> 1 JGSOR.test avgt 3 1177.982 ± 5.108 ms/op
> 2 JGSOR.test avgt 3 622.435 ± 101.466 ms/op
> 4 JGSOR.test avgt 3 310.496 ± 17.681 ms/op
> 8 JGSOR.test avgt 3 170.867 ± 0.911 ms/op
> 16 JGSOR.test avgt 3 98.210 ± 9.236 ms/op
> 32 JGSOR.test avgt 3 58.137 ± 3.501 ms/op
> 64 JGSOR.test avgt 3 35.384 ± 0.922 ms/op
> 128 JGSOR.test avgt 3 22.076 ± 0.197 ms/op
> 256 JGSOR.test avgt 3 15.459 ± 2.312 ms/op
> 1024 JGSOR.test avgt 3 10.180 ± 0.426 ms/op
>
> With n=1. there is no undersampling at all, and we see the
> catastrophic slowdown which is the subject of this bug report. The
> performance improves rapidly, but not quite linearly, with increasing
> subsampling ratios, as you'd expect.
>
> /build/linux-x86_64-server-release/jdk/bin/java -jar ./build/linux-
> x86_64-server-release/images/test/micro/benchmarks.jar SOR -t 16 -f 1 -
> wi 3 -i 3 -r 1 -w 1 -jvmArgs ' -XX:TieredStopAtLevel=3 -
> XX:ProfileCaptureRatio=16'
>
> Surprisingly, the overhead for randomized subsapling isn't so great.
> Here's the speed of the same JGSOR.test with only 1 thread, with
> various subsampling ratios:
>
> 1 JGSOR.test avgt 3 9.087 ± 0.041 ms/op (not undersampled)
> 2 JGSOR.test avgt 3 22.431 ± 0.079 ms/op
> 4 JGSOR.test avgt 3 14.291 ± 0.048 ms/op
> 8 JGSOR.test avgt 3 10.316 ± 0.021 ms/op
> 16 JGSOR.test avgt 3 9.360 ± 0.022 ms/op
> 32 JGSOR.test avgt 3 9.196 ± 0.042 ms/op
>
> We can see that if we undersample 16-fold, then the single-threaded
> overhead for profile counting is no worse than it is with no
> undersampling at all. We could, at least in theory, ship HotSpot with
> 32-fold undersampling as a default, and no one would ever notice,
> except that the poor scaling behaviour would be very much reduced.
>
> However, there is a cost. C1 code size does increase because we have
> to step the random-number generator at every profiling site. It's not
> too bad, because the added code is just something like
>
> mov ebx,0x41c64e6d
> imul r14d,ebx
> add r14d,0x3039
> cmp r14d,0x1000000 // ProfileCaptureRatio
> jae 0x00007fffe099c197
> ... profiling code
>
> but it's definitely bigger.
>
> The POC is C1 only, x86 only, and I haven't done anything about
> profiling the interpreter. I'm sure it has bugs. It'll probably crash
> if you push it too hard. But is very representative, I think, of how
> well a finished implementation would perform.
>
More information about the hotspot-compiler-dev
mailing list