RFR: 8361099: Shenandoah: Improve heap lock contention by using CAS for memory allocation

Mon Sep 22 08:46:41 UTC 2025

On Mon, 7 Jul 2025 19:56:30 GMT, Xiaolong Peng <xpeng at openjdk.org> wrote:

> Shenandoah always allocates memory with heap lock, we have observed heavy heap lock contention on memory allocation path in performance analysis of some service in which we tried to adopt Shenandoah. This change is to propose an optimization for the code path of mutator memory allocation to improve heap lock contention, at vey high level, here is how it works:
> * ShenandoahFreeSet holds a N (default to 13) number of ShenandoahHeapRegion* which are used by mutator threads for regular object allocations, they are called shared regions/directly allocatable regions, which are stored in PaddedEnd data structure(padded array).
> * Each mutator thread will be assigned one of the directly allocatable regions, the thread will try to allocate in the directly allocatable region with CAS atomic operation, if fails will try 2 more consecutive  directly allocatable regions in the array storing directly allocatable region.
> * If mutator thread fails after trying 3 directly allocatable regions, it will:
>    * Take heap lock
>    * Try to retire the directly allocatable regions which are ready to retire.
>    *  Iterator mutator partition and allocate directly allocatable regions and store to the padded array if any need to be retired.
>    *  Satisfy mutator allocation request if possible.
> 
> 
> I'm not expecting significant performance impact for most of the cases since in most case the contention on heap lock it not high enough to cause performance issue, I have done many tests, here are some of them:
> 
> 1. Dacapo lusearch test on EC2 host with 96 CPU cores:
> Openjdk TIP:
> 
> [ec2-user at ip-172-31-42-91 jdk]$ ./master-jdk/bin/java -XX:-TieredCompilation -XX:+AlwaysPreTouch -Xms4G -Xmx4G -XX:+UseShenandoahGC -XX:+UnlockExperimentalVMOptions -XX:+UnlockDiagnosticVMOptions  -XX:-ShenandoahUncommit -XX:ShenandoahGCMode=generational  -XX:+UseTLAB -jar ~/tools/dacapo/dacapo-23.11-MR2-chopin.jar  -n 10 lusearch  | grep "metered full smoothing"
> ===== DaCapo tail latency, metered full smoothing: 50% 131684 usec, 90% 200192 usec, 99% 211369 usec, 99.9% 212517 usec, 99.99% 213043 usec, max 235289 usec, measured over 524288 events =====
> ===== DaCapo tail latency, metered full smoothing: 50% 1568 usec, 90% 36101 usec, 99% 42172 usec, 99.9% 42928 usec, 99.99% 43100 usec, max 43305 usec, measured over 524288 events =====
> ===== DaCapo tail latency, metered full smoothing: 50% 52644 usec, 90% 124393 usec, 99% 137711 usec, 99.9% 139355 usec, 99.99% 139749 usec, max 146722 usec, measured over 524288 events ====...

It would be helpful to have a more detailed overview in the change log description to help reviewers get oriented with the changes.

-------------

PR Review: https://git.openjdk.org/jdk/pull/26171#pullrequestreview-3007644661