RFR: 8359947: GenShen: use smaller TLABs by default [v2]

Wed Jun 18 18:46:40 UTC 2025

On Wed, 18 Jun 2025 16:38:45 GMT, Kelvin Nilsen <kdnilsen at openjdk.org> wrote:

>> We have found with certain workloads that the initial and maximum tlab sizes result in very high latencies for the first few invocations of particular methods for certain threads.  The root cause is that TLABs are too large.  This is causing allocatable memory to be depleted too quickly.  When large numbers of threads are trying to startup at the same time, some of the threads end up with no TLABs or very small TLABs and their efforts run hundreds of times slower than the threads that were able to grab very large TLABs.
>> 
>> This PR reduces the maximum TLAB size and adjusts the initial TLAB size in order to reduce the impact of this problem.
>> 
>> This PR also changes the value of TLABAllocationWeight from 90 to 35 when we are running in generational mode.  35 is the default value used for G1 GC, which is also generational.  The default value of 90 was established years ago for non-generational Shenandoah because it tends to have less frequent GC cycles than generational collectors.
>> 
>> We have exercised this PR with three different workloads, which we identify as small, medium, and huge.  We have also exercised in two different configurations: with and without 30s warmup before latency measurements are taken.  Finally, we have applied this PR to both tip and to a development branch identified as adaptive-evac-with-surge.
>> 
>> The initial motivation for this PR was identified during testing of the adaptive-evac-with-surge branch.  That branch runs more aggressive GCs (larger evacuation workloads, with delayed (slightly more risky) triggers).  The objectives of this branch are to make GCs more efficient and to reduce CPU consumption.
>> 
>> We report 6 results for each experiment.  We sort these according to P100 latencies, and average results from the bottom four (best performing) samples, tossing out the two high outliers from the averages.  Workload results are subject to noise from elastic computing and operating system interference.
>> 
>> The benefits of this PR are most notable with the p99.999 and p100 small configuration of adaptive-evac-with-surge and the huge configuration of tip: 
>> 
>> ![image](https://github.com/user-attachments/assets/def49a3c-4142-48f7-a946-33527e6985d0)
>> 
>> ![image](https://github.com/user-attachments/assets/b0df27b3-f7b0-4fd2-82c3-ac84b0ad380e)
>> 
>> ![image](https://github.com/user-attachments/assets/471c1292-96dc-46c1-9bcc-b851be07867d)
>> 
>> Note also the degradation in p50 and other lower percentile latencies.  The effect ...
>
> Kelvin Nilsen has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Remove debug instrumentation

Looks good to me! 
Smaller tlab will result more contention on heap lock in the slow path to allocate tlab, the task I am working on to use CAS may help to mitigate.

-------------

Marked as reviewed by xpeng (Committer).

PR Review: https://git.openjdk.org/jdk/pull/25423#pullrequestreview-2940310161