RFR: 8359947: GenShen: use smaller TLABs by default

Wed Jun 18 16:32:08 UTC 2025

We have found with certain workloads that the initial and maximum tlab sizes result in very high latencies for the first few invocations of particular methods for certain threads.  The root cause is that TLABs are too large.  This is causing allocatable memory to be depleted too quickly.  When large numbers of threads are trying to startup at the same time, some of the threads end up with no TLABs or very small TLABs and their efforts run hundreds of times slower than the threads that were able to grab very large TLABs.

This PR reduces the maximum TLAB size and adjusts the initial TLAB size in order to reduce the impact of this problem.

This PR also changes the value of TLABAllocationWeight from 90 to 35 when we are running in generational mode.  35 is the default value used for G1 GC, which is also generational.  The default value of 90 was established years ago for non-generational Shenandoah because it tends to have less frequent GC cycles than generational collectors.

We have exercised this PR with three different workloads, which we identify as small, medium, and huge.  We have also exercised in two different configurations: with and without 30s warmup before latency measurements are taken.  Finally, we have applied this PR to both tip and to a development branch identified as adaptive-evac-with-surge.

The initial motivation for this PR was identified during testing of the adaptive-evac-with-surge branch.  That branch runs more aggressive GCs (larger evacuation workloads, with delayed (slightly more risky) triggers.  The objectives of this branch are to make GCs more efficient and to reduce CPU consumption.

We report 6 results for each experiment.  We sort these according to P100 latencies, and average results from the bottom four (best performing) samples, tossing out the two high outliers from the averages.  Workload results are subject to noise from elastic computing and operating system interference.

The benefits of this PR are most notable with the p99.999 and p100 small configuration of adaptive-evac-with-surge and the huge configuration of tip: 

![image](https://github.com/user-attachments/assets/def49a3c-4142-48f7-a946-33527e6985d0)

![image](https://github.com/user-attachments/assets/b0df27b3-f7b0-4fd2-82c3-ac84b0ad380e)

![image](https://github.com/user-attachments/assets/471c1292-96dc-46c1-9bcc-b851be07867d)

Note also the degradation in p50 and other lower percentile latencies.  The effect of this PR is to require each mutator thread to make more frequent allocation of smaller TLABs.  This results in higher contention for Shenandoah's global heap lock.  A separate development effort is refactoring the slow allocation path to use a single CAS for the large majority of TLAB allocations rather than the current mechanism, which requires each TLAB allocation to contend for exclusive access to the global heap lock, then hold the global heap lock during the search for memory to be allocated.  We expect this other mechanism will improve all latencies, and should reduce or eliminate the performance regressions introduced at certain percentiles by this PR.

For full context, other performance comparisons are also provided here.  In the following reports, it is usually the case that the affect of this PR is to reduce the total number of GCs and to reduce the CPU utilization.  Depending on the experiment, low percentile or high percentile response times improve.  We attribute these differences to contention for global heap lock.  We observe also that sometimes warming up an experiment for 30 seconds before beginning to measure latency has a similar effect to improving the initial TLAB size,  as the pre-existing shared heuristics for adaptive TLAB sizing seem to converge fairly quickly on improved behavior, at least for this workload which is fairly consistent.  A more variable workload may struggle more to find the optimal TLAB size.  For such workloads, having smaller default TLAB sizes seems to reduce the likelihood of occasional serious performance degradation.

![image](https://github.com/user-attachments/assets/33205ba0-2ebb-4cfa-96e9-8fecd52d07ab)

![image](https://github.com/user-attachments/assets/1bf49dfc-f673-4371-940f-af76496385fe)

![image](https://github.com/user-attachments/assets/a960d02b-87fe-42bd-9b6d-0232d80d1804)

While both versions of code are clearly overwhelmed for this Medium workload, it is interesting to observe that the PR produces two results that are much better than all six of the control runs.  Better performance and more meaningful comparisons are provided by the adaptive-old-evac-with-surge experiments, shown below.

![image](https://github.com/user-attachments/assets/da83f673-93f5-4712-becf-49c00f5ffe8c)

![image](https://github.com/user-attachments/assets/9f37dd0b-2fbf-4cc9-a1e5-e1b694f125ad)

![image](https://github.com/user-attachments/assets/858b556c-9a44-426e-927f-f6a8d2f249b0)

![image](https://github.com/user-attachments/assets/3e17f4c5-0c5d-4b47-b2ad-6d7504394e31)

![image](https://github.com/user-attachments/assets/2fc79194-244b-4e4d-8547-5fcd02811a69)

The small workload is represented by the following execution script.  Note that we override the default region size in order to avoid out-of-memory-during-evacuation failures that were occurring in the default configuration due to a large number of objects of approximate size 781 KB.

            ~/github/jdk.adjust-initial-tlab-size/build/linux-x86_64-server-release/images/jdk/bin/java \
                -XX:ActiveProcessorCount=2 \
                -XX:+UnlockExperimentalVMOptions \
                -XX:-ShenandoahPacing \
                -XX:+AlwaysPreTouch -XX:+DisableExplicitGC -Xms4g -Xmx4g \
                -XX:+UseShenandoahGC -XX:ShenandoahGCMode=generational \
                -XX:ShenandoahFullGCThreshold=1024 \
                -XX:ShenandoahMinRegionSize=4M \
                -Xlog:"gc*=info,ergo" \
                -Xlog:safepoint=trace -Xlog:safepoint=debug -Xlog:safepoint=info \
                -XX:+UnlockDiagnosticVMOptions \
                -jar ~/github/heapothesys.fix-two-bugs/Extremem/src/main/java/extremem.jar \
                -dDictionarySize=3000000 \
                -dNumCustomers=30000 \
                -dNumProducts=30000 \
                -dCustomerThreads=500 \
                -dCustomerPeriod=5s \
                -dCustomerThinkTime=1s \
                -dKeywordSearchCount=1 \
                -dSelectionCriteriaCount=3 \
                -dProductReviewLength=12 \
                -dServerThreads=5 \
                -dServerPeriod=10s \
                -dProductNameLength=10 \
                -dBrowsingHistoryQueueCount=5 \
                -dSalesTransactionQueueCount=5 \
                -dProductDescriptionLength=40 \
                -dProductReplacementPeriod=60s \
                -dProductReplacementCount=25 \
                -dCustomerReplacementPeriod=60s \
                -dCustomerReplacementCount=1500 \
                -dBrowsingExpiration=1m \
                -dPhasedUpdates=true \
                -dPhasedUpdateInterval=180s \
                -dSimulationDuration=25m \
                -dResponseTimeMeasurements=100000 \
                >$t.genshen.MaxRSWby8-TLABisRSBby128.small.overrides.out 2>$t.genshen.MaxRSWby8-TLABisRSBby128.small.overrides.err &
            job_pid=$!
            sleep 1500
            cpu_percent=$(ps -o cputime -o etime -p $job_pid)
            rss_kb=$(ps -o rss= -p $job_pid)
            rss_mb=$((rss_kb / 1024))
            wait $job_pid
            echo "RSS: $rss_mb MB" >>$t.genshen.MaxRSWby32-TLABisRSBby256.small.overrides.out 2>>$t.genshen.MaxRSWby32-TLABisRSBby256.small.overrides.err
            echo "$cpu_percent" >>$t.genshen.MaxRSWby32-TLABisRSBby256.small.overrides.out
            gzip $t.genshen.MaxRSWby32-TLABisRSBby256.small.overrides.out $t.genshen.MaxRSWby32-TLABisRSBby256.small.overrides.err

The medium workload is represented by this execution script:

            ~/github/jdk.adaptive-evac-with-surge/build/linux-x86_64-server-release/images/jdk/bin/java \
                -XX:+UnlockExperimentalVMOptions \
                -XX:-ShenandoahPacing \
                -XX:+AlwaysPreTouch -XX:+DisableExplicitGC -Xms31g -Xmx31g \
                -XX:+UseShenandoahGC -XX:ShenandoahGCMode=generational \
                -XX:ShenandoahFullGCThreshold=1024 \
                -XX:ShenandoahGuaranteedOldGCInterval=0 \
                -XX:ShenandoahGuaranteedYoungGCInterval=0 \
                -Xlog:"gc*=info,ergo" \
                -Xlog:safepoint=trace -Xlog:safepoint=debug -Xlog:safepoint=info \
                -XX:+UnlockDiagnosticVMOptions \
                -jar ~/github/heapothesys/Extremem/src/main/java/extremem.jar \
                -dDictionarySize=3000000 \
                -dNumCustomers=3600000 \
                -dNumProducts=320000 \
                -dCustomerThreads=500 \
                -dAllowAnyMatch=false \
                -dCustomerPeriod=5s \
                -dCustomerThinkTime=1s \
                -dKeywordSearchCount=2 \
                -dSelectionCriteriaCount=2 \
                -dProductReviewLength=12 \
                -dServerThreads=5 \
                -dServerPeriod=10s \
                -dProductNameLength=10 \
                -dBrowsingHistoryQueueCount=5 \
                -dSalesTransactionQueueCount=5 \
                -dProductDescriptionLength=512 \
                -dProductReplacementPeriod=60s \
                -dProductReplacementCount=25 \
                -dCustomerReplacementPeriod=60s \
                -dCustomerReplacementCount=1500 \
                -dBrowsingExpiration=1m \
                -dPhasedUpdates=true \
                -dPhasedUpdateInterval=180s \
                -dSimulationDuration=25m \
                -dResponseTimeMeasurements=100000 \
                >$t.retry.genshen.medium.adaptive-evac-with-surge.control.out 2>$t.retry.genshen.medium.adaptive-evac-with-surge.control.err &
            job_pid=$!
            sleep 1500
            cpu_percent=$(ps -o cputime -o etime -p $job_pid)
            rss_kb=$(ps -o rss= -p $job_pid)
            rss_mb=$((rss_kb / 1024))
            wait $job_pid
            echo "RSS: $rss_mb MB" >>$t.retry.genshen.medium.adaptive-evac-with-surge.control.out
            echo "$cpu_percent" >>$t.retry.genshen.medium.adaptive-evac-with-surge.control.out
            gzip $t.retry.genshen.medium.adaptive-evac-with-surge.control.out $t.retry.genshen.medium.adaptive-evac-with-surge.control.err

An additional parameter was added to the adaptive-evac-with-surge configurations, to make the adaptive old GC trigger slightly more sensitive:

                -XX:ShenandoahMinOldGenGrowthHeapPercent=2 \

The huge workload is represented by this execution script:

            ~/github/jdk.adjust-initial-tlab-size/build/linux-x86_64-server-release/images/jdk/bin/java \
                -XX:ActiveProcessorCount=16 \
                -XX:+UnlockExperimentalVMOptions \
                -XX:-ShenandoahPacing \
                -XX:+AlwaysPreTouch -XX:+DisableExplicitGC -Xms512g -Xmx512g \
                -XX:+UseShenandoahGC -XX:ShenandoahGCMode=generational \
                -XX:ShenandoahFullGCThreshold=1024 \
                -XX:ShenandoahGuaranteedGCInterval=0 \
                -XX:ShenandoahGuaranteedOldGCInterval=0 \
                -XX:ShenandoahGuaranteedYoungGCInterval=0 \
                -Xlog:"gc*=info,ergo" \
                -Xlog:safepoint=trace -Xlog:safepoint=debug -Xlog:safepoint=info \
                -XX:+UnlockDiagnosticVMOptions \
                -jar ~/github/heapothesys/Extremem/src/main/java/extremem.jar \
                -dDictionarySize=3000000 \
                -dNumCustomers=210000000 \
                -dNumProducts=18000000 \
                -dCustomerThreads=2000 \
                -dCustomerPeriod=2000ms \
                -dCustomerThinkTime=300ms \
                -dKeywordSearchCount=2 \
                -dAllowAnyMatch=false \
                -dSelectionCriteriaCount=3 \
                -dProductReviewLength=96 \
                -dBuyThreshold=0.5 \
                -dSaveForLaterThreshold=0.15 \
                -dBrowsingExpiration=5m \
                -dServerThreads=20 \
                -dServerPeriod=10s \
                -dProductNameLength=6 \
                -dProductDescriptionLength=70 \
                -dBrowsingHistoryQueueCount=1 \
                -dSalesTransactionQueueCount=1 \
                -dProductReplacementPeriod=60s \
                -dProductReplacementCount=25 \
                -dCustomerReplacementPeriod=60s \
                -dCustomerReplacementCount=150 \
                -dBrowsingExpiration=1m \
                -dSimulationDuration=25m \
                -dResponseTimeMeasurements=100000 \
                -dPhasedUpdates=true \
                -dPhasedUpdateInterval=180s \
                >$t.genshen.huge.MaxTLABisRSWby32-TLABisRSBisRSBby256.out 2>$t.genshen.huge.MaxTLABisRSWby32-TLABisRSBisRSBby256.err &
            job_pid=$!
            sleep 3000
            cpu_percent=$(ps -o cputime -o etime -p $job_pid)
            rss_kb=$(ps -o rss= -p $job_pid)
            rss_mb=$((rss_kb / 1024))
            wait $job_pid
            echo "RSS: $rss_kb KB" >>$t.genshen.huge.MaxTLABisRSWby32-TLABisRSBisRSBby256.out
            echo "RSS: $rss_mb MB" >>$t.genshen.huge.MaxTLABisRSWby32-TLABisRSBisRSBby256.out
            echo "$cpu_percent" >>$t.genshen.huge.MaxTLABisRSWby32-TLABisRSBisRSBby256.out
            gzip $t.genshen.huge.MaxTLABisRSWby32-TLABisRSBisRSBby256.out $t.genshen.huge.MaxTLABisRSWby32-TLABisRSBisRSBby256.err

-------------

Commit messages:
 - Fix white space
 - Shrink default TLABSize in half
 - MaxRSWby32-TLABisRSBby128
 - Tidy up for review
 - Merge branch 'adjust-initial-tlab-size' of https://github.com/kdnilsen/jdk into adjust-initial-tlab-size
 - MaxTLABisRSWby1 TLABSisRSBby128 TLABAllocationWeight=35
 - Add constraints for very large heap sizes with MaxTLABisRSWby8 TLABisRSBby128
 - MaxTLABisRSWby1 TLABSisDefault TLABAllocationWeight=90
 - MaxTLABisRSWby1-TLABisDefault
 - MaxTLABisRSWby2 TLABisDefault
 - ... and 40 more: https://git.openjdk.org/jdk/compare/92730945...e8b35937

Changes: https://git.openjdk.org/jdk/pull/25423/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=25423&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8359947
  Stats: 18 lines in 3 files changed: 12 ins; 0 del; 6 mod
  Patch: https://git.openjdk.org/jdk/pull/25423.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/25423/head:pull/25423

PR: https://git.openjdk.org/jdk/pull/25423