RFR: 8361099: Shenandoah: Improve heap lock contention by using CAS for memory allocation [v14]

Wed Dec 3 01:00:11 UTC 2025

On Tue, 2 Dec 2025 18:40:16 GMT, Xiaolong Peng <xpeng at openjdk.org> wrote:

>> Shenandoah always allocates memory with heap lock, we have observed heavy heap lock contention on memory allocation path in performance analysis of some service in which we tried to adopt Shenandoah. This change is to propose an optimization for the code path of memory allocation to improve heap lock contention, along with the optimization, a better OOD is also done to Shenandoah memory allocation to reuse the majority of the code:
>> 
>> * ShenandoahAllocator: base class the allocator, most of the allocation code is in the class.
>> * ShenandoahMutatorAllocator: allocator for mutator, inherit from ShenandoahAllocator, only override methods `alloc_start_index`, `verify`, `_alloc_region_count` and  `_yield_to_safepoint` to customize the allocator for mutator.
>> * ShenandoahCollectorAllocator: allocator for collector allocation in Collector partition, similar to ShenandoahMutatorAllocator, only few lines of code to customize the allocator for Collector. 
>> * ShenandoahOldCollectorAllocator:  allocator for mutator collector allocation in OldCollector partition, it doesn't inherit the logic from ShenandoahAllocator for now, the `allocate` method has been overridden to delegate to `FreeSet::allocate_for_collector` due to the special allocation considerations for `plab` in old gen. We will rewrite this part later and move the code out of `FreeSet::allocate_for_collector`
>> 
>> I'm not expecting significant performance impact for most of the cases since in most case the contention on heap lock it not high enough to cause performance issue, but in some cases it may improve the latency/performance:
>> 
>> 1. Dacapo lusearch test on EC2 host with 96 CPU cores, p90 is improved from 500+us to less than 150us, p99 from 1000+us to ~200us. 
>> 
>> java -XX:-TieredCompilation -XX:+AlwaysPreTouch -Xms31G -Xmx31G -XX:+UseShenandoahGC -XX:+UnlockExperimentalVMOptions -XX:+UnlockDiagnosticVMOptions  -XX:-ShenandoahUncommit -XX:ShenandoahGCMode=generational  -XX:+UseTLAB -jar ~/tools/dacapo/dacapo-23.11-MR2-chopin.jar  -n 10 lusearch  | grep "metered full smoothing"
>> 
>> 
>> Openjdk TIP:
>> 
>> ===== DaCapo tail latency, metered full smoothing: 50% 241098 usec, 90% 402356 usec, 99% 411065 usec, 99.9% 411763 usec, 99.99% 415531 usec, max 428584 usec, measured over 524288 events =====
>> ===== DaCapo tail latency, metered full smoothing: 50% 902 usec, 90% 3713 usec, 99% 5898 usec, 99.9% 6488 usec, 99.99% 7081 usec, max 8048 usec, measured over 524288 events =====
>> ===== DaCapo tail laten...
>
> Xiaolong Peng has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 255 commits:
> 
>  - Add missing header for ShenandoahFreeSetPartitionId
>  - Declare ShenandoahFreeSetPartitionId as enum instead of enum class
>  - Fix a typo
>  - Remove unnecessary `enum class ShenandoahFreeSetPartitionId : uint8_t` in shenandoahAllocator.php
>  - Make ShenandoahAllocator as template class to make compiled code more efficient for each alloc partition
>  - Port the fix of JDK-8372566
>  - Merge branch 'master' into cas-alloc-1
>  - Merge remote-tracking branch 'origin/master' into cas-alloc-1
>  - Remove junk code
>  - Remove unnecessary change and tidy up
>  - ... and 245 more: https://git.openjdk.org/jdk/compare/79e99bb0...7980c039

I'm still reading through the code, but have these comments far...

src/hotspot/share/gc/shenandoah/shenandoahAllocator.cpp line 80:

> 78:           break;
> 79:         case ShenandoahFreeSetPartitionId::OldCollector:
> 80:           _free_set->recompute_total_used</* UsedByMutatorChanged */ true,

These parameters seem overly conservative.  Can we distinguish what needs to be recomputed?
Normally, OldCollector allocation does not change UsedByMutator or UsedByCollector.  It will only change MutatorEmpties if we did flip_to_old.  It will normally not changed OldCollectorEmpties (unless it flips multiple mutator to OldCollector.  if might flip one region from mutator, but that region will not be empty after we allocate fro it...

src/hotspot/share/gc/shenandoah/shenandoahAllocator.cpp line 100:

> 98: HeapWord* ShenandoahAllocator<ALLOC_PARTITION>::attempt_allocation(ShenandoahAllocRequest& req, bool& in_new_region) {
> 99:   if (_alloc_region_count == 0u) {
> 100:     ShenandoahHeapLocker locker(ShenandoahHeap::heap()->lock(), _yield_to_safepoint);

Looking for more comments here as well.  What does it mean that _alloc_region_count == 0?  Does this mean we have not yet initialized the directly allocatable regions (following a particular GC event)?  Or does it mean that we have depleted all of the available regions and we are out of memory? In the first case, it seems we would want to replenish our supply of directly allocatable regions while we hold the GC lock.  In the second case, it seems there's really no value in even attempting a slow allocation.  (If we were unable to refresh our directly allocatable regions, then it will not find allocatable memory even on the other side of the heap lock...)

src/hotspot/share/gc/shenandoah/shenandoahAllocator.cpp line 121:

> 119: template <ShenandoahFreeSetPartitionId ALLOC_PARTITION>
> 120: HeapWord* ShenandoahAllocator<ALLOC_PARTITION>::attempt_allocation_slow(ShenandoahAllocRequest& req, bool& in_new_region) {
> 121:   ShenandoahHeapLocker locker(ShenandoahHeap::heap()->lock(), _yield_to_safepoint);

I think this is an error.  We don't want to acquire the lock here.  We also don't want to introduce accounting_update here.  Instead, I think these belong before line 130, in case we need to refresh the alloc regions.

src/hotspot/share/gc/shenandoah/shenandoahAllocator.cpp line 155:

> 153:   size_t min_free_words = req.is_lab_alloc() ? req.min_size() : req.size();
> 154:   ShenandoahHeapRegion* r = _free_set->find_heap_region_for_allocation(ALLOC_PARTITION, min_free_words, req.is_lab_alloc(), in_new_region);
> 155:   // The region returned by find_heap_region_for_allocation must have sufficient free space for the allocation it if it is not nullptr

comment has an extra "it"

src/hotspot/share/gc/shenandoah/shenandoahAllocator.cpp line 158:

> 156:   if (r != nullptr) {
> 157:     bool ready_for_retire = false;
> 158:     obj = atomic_allocate_in(r, false, req, in_new_region, ready_for_retire);

Not sure why we use atomic_allocate_in() here.  We hold the heap lock so we don't need to use atomic operations.  
We should clarify with comments.

src/hotspot/share/gc/shenandoah/shenandoahAllocator.hpp line 69:

> 67: 
> 68:   // Attempt to allocate in shared alloc regions, the allocation attempt is done with atomic operation w/o
> 69:   // holding heap lock.

I would rewrite comment:
// Attempt to allocate in a shared alloc region using atomic operation without holding the heap lock.
// Returns nullptr and overwrites regions_ready_for_refresh with the number of shared alloc regions that are ready
// to be retired if it is unable to satisfy the allocation request from the existing shared alloc regions.

-------------

PR Review: https://git.openjdk.org/jdk/pull/26171#pullrequestreview-3532274683
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2582914041
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2582950768
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2582966259
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2582922617
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2582936150
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2582959962