RFD: Subsampled profile counters in HotSpot

Fri Feb 14 18:14:07 UTC 2025

Thank you, Andrew, for sharing this data from your experiment.
This is indeed very promising.

Thanks,
Vladimir K

On 2/14/25 10:01 AM, Andrew Haley wrote:
> This is JDK-8134940: TieredCompilation profiling can exhibit poor 
> scalability.
> 
> (Thanks to Igor Veresov for the inspiration and advice!)
> 
> The poster child for awful profile scaling is SOR,
> https://cr.openjdk.org/~redestad/scratch/sor-jmh/
> 
> Here is the TieredStopAtLevel=3 performance with just one thread:
> 
> Benchmark   Mode  Cnt  Score   Error  Units
> JGSOR.test  avgt    3  9.144 ± 1.044  ms/op
> 
> and with 16 hardware threads:
> 
> JGSOR.test  avgt    3  1177.982 ± 5.108  ms/op
> 
> So it's a 100-fold performance drop. This is a real problem I've
> seen in production deployments.
> 
> I've been looking at the idea of incrementing profile counters less
> frequently, recording events in a pseudo-random subsampled way. For
> example, you could set the ProfileCaptureRatio=16 and then, at random,
> only 1/16th of counter updates would be recorded.
> 
> In theory, those 1/16th of updates should be representative, but
> theory is not necessarily the same as practice. Here's where
> Statistics comes to our rescue, though. I am not a statistician, but I
> think the central theorem of statistics applies. It "describes the
> asymptotic behaviour of the empirical distribution function as the
> number of independent and identically distributed observations grows.
> Specifically, the empirical distribution function converges uniformly
> to the true distribution function almost surely." (It doesn't say how
> long that'll take, though.)
> 
> So, as long as the random-number generator we use is completely
> uncorrelated with the process we're profiling, even a substantially
> undersampled set of profile counters should converge to the same
> ratios we'd have if not undersampling.
> 
> The poster child for awful profile scaling is SOR,
> https://cr.openjdk.org/~redestad/scratch/sor-jmh/
> 
> Here is the TieredStopAtLevel=3 performance with just one thread:
> 
> Benchmark   Mode  Cnt  Score   Error  Units
> JGSOR.test  avgt    3  9.144 ± 1.044  ms/op
> 
> and with 16 hardware threads:
> 
> JGSOR.test  avgt    3  1177.982 ± 5.108  ms/op
> 
> So it's a 100-fold performance drop.
> 
> I have done a very rough proof-of-concept implementation of
> subsampling. It's at
> https://github.com/openjdk/jdk/pull/23643/files
> It's not fit for anything much except to demonstrate the feasibility
> of using this approach. While the code isn't great, I think that it
> does fairly represent the performance we could expect if we decided to
> go with this approach.
> 
> These are the JMH results for SOR, 16 threads with various subsampling
> ratios, controlled by -XX:ProfileCaptureRatio=n:
> 
> n   Benchmark   Mode  Cnt   Score   Error  Units
> 
> 1   JGSOR.test  avgt    3  1177.982 ± 5.108  ms/op
> 2   JGSOR.test  avgt    3  622.435 ± 101.466  ms/op
> 4   JGSOR.test  avgt    3  310.496 ± 17.681  ms/op
> 8   JGSOR.test  avgt    3  170.867 ± 0.911  ms/op
> 16  JGSOR.test  avgt    3  98.210 ± 9.236  ms/op
> 32  JGSOR.test  avgt    3  58.137 ± 3.501  ms/op
> 64  JGSOR.test  avgt    3  35.384 ± 0.922  ms/op
> 128 JGSOR.test  avgt    3  22.076 ± 0.197  ms/op
> 256 JGSOR.test  avgt    3  15.459 ± 2.312  ms/op
> 1024 JGSOR.test avgt    3  10.180 ± 0.426  ms/op
> 
> With n=1. there is no undersampling at all, and we see the
> catastrophic slowdown which is the subject of this bug report. The
> performance improves rapidly, but not quite linearly, with increasing
> subsampling ratios, as you'd expect.
> 
> /build/linux-x86_64-server-release/jdk/bin/java -jar ./build/linux- 
> x86_64-server-release/images/test/micro/benchmarks.jar SOR -t 16 -f 1 - 
> wi 3 -i 3 -r 1 -w 1 -jvmArgs ' -XX:TieredStopAtLevel=3 - 
> XX:ProfileCaptureRatio=16'
> 
> Surprisingly, the overhead for randomized subsapling isn't so great.
> Here's the speed of the same JGSOR.test with only 1 thread, with
> various subsampling ratios:
> 
> 1   JGSOR.test  avgt    3  9.087 ± 0.041  ms/op (not undersampled)
> 2   JGSOR.test  avgt    3  22.431 ± 0.079  ms/op
> 4   JGSOR.test  avgt    3  14.291 ± 0.048  ms/op
> 8   JGSOR.test  avgt    3  10.316 ± 0.021  ms/op
> 16  JGSOR.test  avgt    3  9.360 ± 0.022  ms/op
> 32  JGSOR.test  avgt    3  9.196 ± 0.042  ms/op
> 
> We can see that if we undersample 16-fold, then the single-threaded
> overhead for profile counting is no worse than it is with no
> undersampling at all. We could, at least in theory, ship HotSpot with
> 32-fold undersampling as a default, and no one would ever notice,
> except that the poor scaling behaviour would be very much reduced.
> 
> However, there is a cost. C1 code size does increase because we have
> to step the random-number generator at every profiling site. It's not
> too bad, because the added code is just something like
> 
>     mov    ebx,0x41c64e6d
>     imul   r14d,ebx
>     add    r14d,0x3039
>     cmp    r14d,0x1000000   // ProfileCaptureRatio
>     jae    0x00007fffe099c197
>     ... profiling code
> 
> but it's definitely bigger.
> 
> The POC is C1 only, x86 only, and I haven't done anything about
> profiling the interpreter. I'm sure it has bugs. It'll probably crash
> if you push it too hard. But is very representative, I think, of how
> well a finished implementation would perform.
>