RFR: 8361099: Shenandoah: Improve heap lock contention by using CAS for memory allocation

Mon Sep 22 08:46:32 UTC 2025

Shenandoah always allocates memory with heap lock, we have observed heavy heap lock contention on memory allocation path in performance analysis of some service in which we tried to adopt Shenandoah. This change is to propose an optimization for the code path of mutator memory allocation to improve heap lock contention, at vey high level, here is how it works:
* ShenandoahFreeSet holds a N (default to 13) number of ShenandoahHeapRegion* which are used by mutator threads for regular object allocations, they are called shared regions/directly allocatable regions, which are stored in PaddedEnd data structure(padded array).
* Each mutator thread will be assigned one of the directly allocatable regions, the thread will try to allocate in the directly allocatable region with CAS atomic operation, if fails will try 2 more consecutive  directly allocatable regions in the array storing directly allocatable region.
* If mutator thread fails after trying 3 directly allocatable regions, it will:
   * Take heap lock
   * Try to retire the directly allocatable regions which are ready to retire.
   *  Iterator mutator partition and allocate directly allocatable regions and store to the padded array if any need to be retired.
   *  Satisfy mutator allocation request if possible.

I'm not expecting significant performance impact for most of the cases since in most case the contention on heap lock it not high enough to cause performance issue, I have done many tests, here are some of them:

1. Dacapo lusearch test on EC2 host with 96 CPU cores:
Openjdk TIP:

[ec2-user at ip-172-31-42-91 jdk]$ ./master-jdk/bin/java -XX:-TieredCompilation -XX:+AlwaysPreTouch -Xms4G -Xmx4G -XX:+UseShenandoahGC -XX:+UnlockExperimentalVMOptions -XX:+UnlockDiagnosticVMOptions  -XX:-ShenandoahUncommit -XX:ShenandoahGCMode=generational  -XX:+UseTLAB -jar ~/tools/dacapo/dacapo-23.11-MR2-chopin.jar  -n 10 lusearch  | grep "metered full smoothing"
===== DaCapo tail latency, metered full smoothing: 50% 131684 usec, 90% 200192 usec, 99% 211369 usec, 99.9% 212517 usec, 99.99% 213043 usec, max 235289 usec, measured over 524288 events =====
===== DaCapo tail latency, metered full smoothing: 50% 1568 usec, 90% 36101 usec, 99% 42172 usec, 99.9% 42928 usec, 99.99% 43100 usec, max 43305 usec, measured over 524288 events =====
===== DaCapo tail latency, metered full smoothing: 50% 52644 usec, 90% 124393 usec, 99% 137711 usec, 99.9% 139355 usec, 99.99% 139749 usec, max 146722 usec, measured over 524288 events =====
===== DaCapo tail latency, metered full smoothing: 50% 28623 usec, 90% 67126 usec, 99% 73916 usec, 99.9% 75175 usec, 99.99% 75462 usec, max 79066 usec, measured over 524288 events =====
===== DaCapo tail latency, metered full smoothing: 50% 41 usec, 90% 9428 usec, 99% 21770 usec, 99.9% 23106 usec, 99.99% 23443 usec, max 24179 usec, measured over 524288 events =====
===== DaCapo tail latency, metered full smoothing: 50% 44215 usec, 90% 89790 usec, 99% 96317 usec, 99.9% 97160 usec, 99.99% 97447 usec, max 101699 usec, measured over 524288 events =====
===== DaCapo tail latency, metered full smoothing: 50% 5760 usec, 90% 55041 usec, 99% 66487 usec, 99.9% 67538 usec, 99.99% 67993 usec, max 75581 usec, measured over 524288 events =====
===== DaCapo tail latency, metered full smoothing: 50% 44 usec, 90% 10890 usec, 99% 23198 usec, 99.9% 24500 usec, 99.99% 24880 usec, max 25787 usec, measured over 524288 events =====
===== DaCapo tail latency, metered full smoothing: 50% 173 usec, 90% 28241 usec, 99% 44742 usec, 99.9% 46328 usec, 99.99% 46815 usec, max 51731 usec, measured over 524288 events =====
===== DaCapo tail latency, metered full smoothing: 50% 3386 usec, 90% 52216 usec, 99% 65358 usec, 99.9% 67119 usec, 99.99% 67612 usec, max 75257 usec, measured over 524288 events =====

With CAS allocator:

[ec2-user at ip-172-31-42-91 jdk]$ ./cas-alloc-jdk/bin/java -XX:-TieredCompilation -XX:+AlwaysPreTouch -Xms4G -Xmx4G -XX:+UseShenandoahGC -XX:+UnlockExperimentalVMOptions -XX:+UnlockDiagnosticVMOptions  -XX:-ShenandoahUncommit -XX:ShenandoahGCMode=generational  -XX:+UseTLAB -jar ~/tools/dacapo/dacapo-23.11-MR2-chopin.jar  -n 10 lusearch  | grep "metered full smoothing"
===== DaCapo tail latency, metered full smoothing: 50% 116096 usec, 90% 181392 usec, 99% 194708 usec, 99.9% 197017 usec, 99.99% 198255 usec, max 209449 usec, measured over 524288 events =====
===== DaCapo tail latency, metered full smoothing: 50% 26 usec, 90% 205 usec, 99% 2192 usec, 99.9% 5521 usec, 99.99% 9909 usec, max 20341 usec, measured over 524288 events =====
===== DaCapo tail latency, metered full smoothing: 50% 30 usec, 90% 206 usec, 99% 819 usec, 99.9% 6330 usec, 99.99% 18174 usec, max 26832 usec, measured over 524288 events =====
===== DaCapo tail latency, metered full smoothing: 50% 26 usec, 90% 188 usec, 99% 770 usec, 99.9% 9462 usec, 99.99% 14628 usec, max 20693 usec, measured over 524288 events =====
===== DaCapo tail latency, metered full smoothing: 50% 26 usec, 90% 191 usec, 99% 793 usec, 99.9% 9579 usec, 99.99% 15123 usec, max 18807 usec, measured over 524288 events =====
===== DaCapo tail latency, metered full smoothing: 50% 30 usec, 90% 211 usec, 99% 654 usec, 99.9% 5401 usec, 99.99% 14597 usec, max 30266 usec, measured over 524288 events =====
===== DaCapo tail latency, metered full smoothing: 50% 27 usec, 90% 196 usec, 99% 794 usec, 99.9% 4499 usec, 99.99% 21784 usec, max 32097 usec, measured over 524288 events =====
===== DaCapo tail latency, metered full smoothing: 50% 32 usec, 90% 236 usec, 99% 546 usec, 99.9% 4874 usec, 99.99% 17115 usec, max 24615 usec, measured over 524288 events =====
===== DaCapo tail latency, metered full smoothing: 50% 30 usec, 90% 202 usec, 99% 698 usec, 99.9% 7301 usec, 99.99% 12966 usec, max 14096 usec, measured over 524288 events =====
===== DaCapo tail latency, metered full smoothing: 50% 26 usec, 90% 193 usec, 99% 740 usec, 99.9% 5625 usec, 99.99% 16836 usec, max 22915 usec, measured over 524288 events =====

Same tests, if I run it on smaller instance with less CPU cores, the improvement is much less. 

2. specJBB2015 on EC2 with 2 cores, 4G heap:

OpenJDK tip

jbb2015.result.category = SPECjbb2015-Composite
jbb2015.result.group.count = 1
jbb2015.result.metric.max-jOPS = 2801
jbb2015.result.metric.critical-jOPS = 1463
jbb2015.result.SLA-10000-jOPS = 967
jbb2015.result.SLA-25000-jOPS = 1260
jbb2015.result.SLA-50000-jOPS = 1532
jbb2015.result.SLA-75000-jOPS = 1861
jbb2015.result.SLA-100000-jOPS = 1927

With CAS allocator:

jbb2015.result.category = SPECjbb2015-Composite
jbb2015.result.group.count = 1
jbb2015.result.metric.max-jOPS = 2965
jbb2015.result.metric.critical-jOPS = 1487
jbb2015.result.SLA-10000-jOPS = 906
jbb2015.result.SLA-25000-jOPS = 1197
jbb2015.result.SLA-50000-jOPS = 1692
jbb2015.result.SLA-75000-jOPS = 1867
jbb2015.result.SLA-100000-jOPS = 2125

2. specJBB2015 on EC2 with 16 cores, 31G heap:
OpenJDK tip:

jbb2015.result.category = SPECjbb2015-Composite
jbb2015.result.group.count = 1
jbb2015.result.metric.max-jOPS = 26740
jbb2015.result.metric.critical-jOPS = 22588
jbb2015.result.SLA-10000-jOPS = 15998
jbb2015.result.SLA-25000-jOPS = 23140
jbb2015.result.SLA-50000-jOPS = 24454
jbb2015.result.SLA-75000-jOPS = 25094
jbb2015.result.SLA-100000-jOPS = 25882

With CAS allocator:

jbb2015.result.category = SPECjbb2015-Composite
jbb2015.result.group.count = 1
jbb2015.result.metric.max-jOPS = 28454
jbb2015.result.metric.critical-jOPS = 23763
jbb2015.result.SLA-10000-jOPS = 16284
jbb2015.result.SLA-25000-jOPS = 24511
jbb2015.result.SLA-50000-jOPS = 25882
jbb2015.result.SLA-75000-jOPS = 26911
jbb2015.result.SLA-100000-jOPS = 27254

### Other tests:
- [x] All Shenandoah jtreg tests

-------------

Commit messages:
 - Merge branch 'master' into cas-alloc-1
 - Move ShenandoahHeapRegionIterationClosure to shenandoahFreeSet.hpp
 - Merge branch 'openjdk:master' into cas-alloc-1
 - Fix errors caused by renaming ofAtomic to AtomicAccess
 - Merge branch 'openjdk:master' into cas-alloc-1
 - Remove unused flag
 - Merge branch 'openjdk:master' into cas-alloc-1
 - Merge branch 'cas-alloc-1' into cas-alloc
 - Merge branch 'cas-alloc-1' of https://github.com/pengxiaolong/jdk into cas-alloc-1
 - Merge branch 'master' into cas-alloc-1
 - ... and 120 more: https://git.openjdk.org/jdk/compare/44454633...087b54fb

Changes: https://git.openjdk.org/jdk/pull/26171/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=26171&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8361099
  Stats: 735 lines in 16 files changed: 674 ins; 7 del; 54 mod
  Patch: https://git.openjdk.org/jdk/pull/26171.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/26171/head:pull/26171

PR: https://git.openjdk.org/jdk/pull/26171