RFR: 8361099: Shenandoah: Improve heap lock contention by using CAS for memory allocation [v5]

Wed Nov 5 22:31:22 UTC 2025

On Mon, 27 Oct 2025 21:59:35 GMT, Xiaolong Peng <xpeng at openjdk.org> wrote:

>> Shenandoah always allocates memory with heap lock, we have observed heavy heap lock contention on memory allocation path in performance analysis of some service in which we tried to adopt Shenandoah. This change is to propose an optimization for the code path of mutator memory allocation to improve heap lock contention, at vey high level, here is how it works:
>> * ShenandoahFreeSet holds a N (default to 13) number of ShenandoahHeapRegion* which are used by mutator threads for regular object allocations, they are called shared regions/directly allocatable regions, which are stored in PaddedEnd data structure(padded array).
>> * Each mutator thread will be assigned one of the directly allocatable regions, the thread will try to allocate in the directly allocatable region with CAS atomic operation, if fails will try 2 more consecutive  directly allocatable regions in the array storing directly allocatable region.
>> * If mutator thread fails after trying 3 directly allocatable regions, it will:
>>    * Take heap lock
>>    * Try to retire the directly allocatable regions which are ready to retire.
>>    *  Iterator mutator partition and allocate directly allocatable regions and store to the padded array if any need to be retired.
>>    *  Satisfy mutator allocation request if possible.
>> 
>> 
>> I'm not expecting significant performance impact for most of the cases since in most case the contention on heap lock it not high enough to cause performance issue, I have done many tests, here are some of them:
>> 
>> 1. Dacapo lusearch test on EC2 host with 96 CPU cores:
>> Openjdk TIP:
>> 
>> [ec2-user at ip-172-31-42-91 jdk]$ ./master-jdk/bin/java -XX:-TieredCompilation -XX:+AlwaysPreTouch -Xms4G -Xmx4G -XX:+UseShenandoahGC -XX:+UnlockExperimentalVMOptions -XX:+UnlockDiagnosticVMOptions  -XX:-ShenandoahUncommit -XX:ShenandoahGCMode=generational  -XX:+UseTLAB -jar ~/tools/dacapo/dacapo-23.11-MR2-chopin.jar  -n 10 lusearch  | grep "metered full smoothing"
>> ===== DaCapo tail latency, metered full smoothing: 50% 131684 usec, 90% 200192 usec, 99% 211369 usec, 99.9% 212517 usec, 99.99% 213043 usec, max 235289 usec, measured over 524288 events =====
>> ===== DaCapo tail latency, metered full smoothing: 50% 1568 usec, 90% 36101 usec, 99% 42172 usec, 99.9% 42928 usec, 99.99% 43100 usec, max 43305 usec, measured over 524288 events =====
>> ===== DaCapo tail latency, metered full smoothing: 50% 52644 usec, 90% 124393 usec, 99% 137711 usec, 99.9% 139355 usec, 99.99% 139749 usec, max 146722 ...
>
> Xiaolong Peng has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 135 commits:
> 
>  - Merge branch 'openjdk:master' into cas-alloc-1
>  - Merge branch 'openjdk:master' into cas-alloc-1
>  - format
>  - Merge branch 'openjdk:master' into cas-alloc-1
>  - Merge branch 'openjdk:master' into cas-alloc-1
>  - Merge branch 'master' into cas-alloc-1
>  - Move ShenandoahHeapRegionIterationClosure to shenandoahFreeSet.hpp
>  - Merge branch 'openjdk:master' into cas-alloc-1
>  - Fix errors caused by renaming ofAtomic to AtomicAccess
>  - Merge branch 'openjdk:master' into cas-alloc-1
>  - ... and 125 more: https://git.openjdk.org/jdk/compare/2f613911...e6bfef05

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 762:

> 760: #endif
> 761: 
> 762: PaddedEnd<ShenandoahDirectlyAllocatableRegionAffinity::Affinity>* ShenandoahDirectlyAllocatableRegionAffinity::_affinity = nullptr;

IIUC, each of the DirectlyAllocatableRegions has an affinity to a particular thread.  If some other thread randomly selects to allocate from this region, we change the region's affinity.  Now, if the original thread tries another allocation, it will be redirected to a different randomly selected region.

It feels to me like this introduces more churn than necessary.  It should be ok for multiple threads to have affinity to the same directly allocatable region so they can preserve "locality".  The locality to be preserved is the value of region->top() and region->end(), in local caches.  Sometimes, the multiple threads that are affiliated with a particular region will be running on the same core, which helps improve cache performance.  If they are on different cores, that's now worse than the current implementation.

So my thinking is that each thread should maintain an affinity to a particular index within the directly allocatable region array.  It should reaffiliate with a different region only if that region entry becomes nullptr, or in case its attempt to CAS allocate fails more than twice in a row (because of heavy contention on the value of top for that region).

Please clarify if I've misunderstood.  Please explain rationale if there is good reason for preferring the design as it is currently implemented.

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 848:

> 846: HeapWord* ShenandoahFreeSet::allocate_with_affiliation(Iter& iterator, ShenandoahAffiliation affiliation, ShenandoahAllocRequest& req, bool& in_new_region) {
> 847:   for (idx_t idx = iterator.current(); iterator.has_next(); idx = iterator.next()) {
> 848:     ShenandoahHeapRegion* r = _heap->get_region(idx);

I wonder if we could refine this a little bit.  When the region is moved into the "directly allocatable" set, wouldn't we remove it from its partition?  Then, we wouldn't have to test for !r->reserved_for_direct_allocation() here because the iterator wouldn't produce it. 

We could maybe replace this test with an assert that !r->reserved_for_direct_allocation().

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 1268:

> 1266:     // If region is not completely free, the current [beg; end] is useless, and we may fast-forward.  If we can extend
> 1267:     // the existing range, we can exploit that certain regions are already known to be in the Mutator free set.
> 1268:     ShenandoahHeapRegion* region = _heap->get_region(end);

Here also, if we remove the region from the partition when we make it directly allocatable, we would not need to rewrite this loop.

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 2070:

> 2068:       // retired, the sum of used and capacities within regions that are still in the Mutator free partition may not match
> 2069:       // my internally tracked values of used() and free().
> 2070:       //TODO remove assert, it is not possible to mach since mutators may allocate on region w/o acquiring lock

It seems that if we are properly adjusting used() when we make a region directly allocatable, then this assert would still be valid.  Why not?

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 2197:

> 2195:     // Intentionally not using AtomicAccess::load, if a mutator see a stale region it will fail to allocate anyway.
> 2196:     if ((r = shared_region._address) != nullptr && r->reserved_for_direct_allocation()) {
> 2197:       obj = cas_allocate_in_for_mutator<IS_TLAB>(r, req, in_new_region);

There are multiple things that can go wrong here, and it's not clear how we distinguish between them:
1. Region r may be retirable and not have enough memory to satisfy req.
2. Region r may not be retirable and not have enough memory to satisfy req.
3. Region r may have sufficient memory to satisfy r but we experience heavy contention (multiple CAS failures) while we attempt to allocate r.

In my mental model, I think I might be inclined to behave as follows:

1. Inside case_allocate_in_for_mutator(), if we satisfy our allocation request, but only after "too many" (more than 2) CAS retries, do not update my thread-local _index variable.  Leave it at its previously selected value.  Otherwise, on successful allocation, update _index to represent current slot. (might have to pass current slot in as an argument)
2. If cas_allocate_in_for_mutator() returns nullptr, ask the question "is this region retirable?"  If so, increment a local count of retirable_regions.  If the incremented count of retirable_regions exceeds some threshold value (DirectlyAllocatableRegionCount / 4?) (and there may be more than this number of retirable regions, but we've seen at least this many), grab the heap lock to retire and replenish all retirable regions.  Then restart this loop with i = 0;

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 2204:

> 2202:     i++;
> 2203:   }
> 2204:   return obj;

I think obj always equals nullptr at this point.  Seems the code would be easier to understand (and would depend less on effective compiler optimization) if we just made that explicit.  Can we just say:

return nullptr?

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 2206:

> 2204:   return obj;
> 2205: }
> 2206: template<bool IS_TLAB>

space above this line?

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 2219:

> 2217:   const uint max_probes = ShenandoahDirectAllocationMaxProbes;
> 2218:   for (;;) {
> 2219:     HeapWord* obj = nullptr;

This is written as an infinite loop, but I'm not sure it intends to iterate?  It looks like there's no way for the loop body to get to a second iteration.

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 2235:

> 2233:         start_idx = next_start_index;
> 2234:       } else {
> 2235:         // try to steal from other directly allocatable regions

Why not just let the enclosing infinite loop (for(;;)) iterate, which will call cas_allocate_single_for_mutator()?  You could overwrite start_idx with a new value if you want.  I'm not sure I agree with choosing start_idx + max_probes.  It seems a more likely new value would be next_start_index.

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 2256:

> 2254: }
> 2255: 
> 2256: // Explicit specializations

I'm sorry.  I don't understand what is happening here with "explicit specializations".  Maybe a comment to explain this less common C++ syntax would help here.

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 2280:

> 2278: }
> 2279: 
> 2280: class DirectAllocatableRegionRefillClosure final : public ShenandoahHeapRegionIterationClosure {

I don't think we want to subclass ShenandoahHeapRegionIterationClosure here.  That iterates over all 2000 regions.  We only want to iterate over the 13 Directly allocatable regions.  Maybe we don't even need/want a closure iterator here.  We could just write a loop.

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 2347:

> 2345:   }
> 2346: 
> 2347:   bool heap_region_do(ShenandoahHeapRegion *r) {

Need a comment to explain what heap_region_do() is doing.

I think it is doing:
1. Returns true if there is no need to iterate over additional regions, otherwise returns false
2. If we've already found a region to hold the requested allocation and there are no more regions to retire, we return true.
3. If the region is trash or is free and we're not doing concurrent_weak_roots, we try to recycle it, setting its affiliation to young.  (I'm not understanding how this works.  What if the region had been placed in the Collector or OldCollector partitions?)
4. If the requested object has not yet been allocated and this region has sufficient memory to represent the object, allocate it here.  (We do all this while holding the heap lock, so we can check available memory at entry to the function, and use the available memory further below within the function.)
5. If this region has more than min_tlab_size memory (after allocating _obj) and there's another region to be retired, we use replace the next retirable region with this region.

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 2383:

> 2381:       ShenandoahDirectAllocationRegion& shared_region = _direct_allocation_regions[_next_retire_eligible_region];
> 2382:       assert(AtomicAccess::load(&shared_region._address) == nullptr, "Must have been released.");
> 2383:       r->reserve_for_direct_allocation();

Here also, I wonder if we can simplify the protocol for assigning this region into the directly allocatable array.

Maybe the key idea could be to introduce a new volatile variable into ShenandoahHeapRegion known as top_for_cas.  cas allocations allocate by incrementing top_for_cas.  Before removing a region from the directly allocatable set, we use cas to increase top_for_cas to end().  Before placing a region into the directly allocatable set (and while holding the heap lock), we copy top() to top_for_cas.

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 2418:

> 2416:   iterate_regions_for_alloc<true, false>(&cl, false);
> 2417:   if (cl._next_region_with_sufficient_mem != ShenandoahDirectlyAllocatableRegionCount && obj == nullptr) {
> 2418:     new_start_index = cl._next_region_with_sufficient_mem;

I think an accessor method is preferred over direct access to a "private" variable.

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.hpp line 263:

> 261:     _capacity[int(which_partition)] = value;
> 262:     _available[int(which_partition)] = value - _used[int(which_partition)];
> 263:     AtomicAccess::store(_capacity + int(which_partition), value);

Shouldn't require AtomicAccess here, because we hold heap lock.

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.hpp line 271:

> 269:     _used[int(which_partition)] = value;
> 270:     _available[int(which_partition)] = _capacity[int(which_partition)] - value;
> 271:     AtomicAccess::store(_used + int(which_partition), value);

Also here, should not require AtomicAccess.

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.hpp line 464:

> 462:   HeapWord* cas_allocate_in_for_mutator(ShenandoahHeapRegion* region, ShenandoahAllocRequest &req, bool &in_new_region);
> 463: 
> 464:   bool try_allocate_directly_allocatable_regions(uint start_index,

Can we have a block comment description of what this function does, including its impact on var parameters.

src/hotspot/share/gc/shenandoah/shenandoahHeapRegion.cpp line 363:

> 361: }
> 362: 
> 363: void ShenandoahHeapRegion::reset_alloc_metadata() {

Do we need to make these atomic because we now increment asynchronously from within mutator CAS allocations?  Before, they were only adjusted while holding heap lock?  I'm wondering if add-with-fetch() or CAS() would be more/less efficient than AtomicAccess::stores.  Can we test the tradeoffs?

src/hotspot/share/gc/shenandoah/shenandoahHeapRegion.inline.hpp line 131:

> 129: }
> 130: 
> 131: HeapWord* ShenandoahHeapRegion::allocate_atomic(size_t size, const ShenandoahAllocRequest& req) {

As mentioned above, this is maybe where we would want to set the thread-local _index variable, and this is also where we might want to count how many times we retry the try_allocate() request before we succeed.

The point is that if we have multiple try_allocate() CAS failures, that means this region is heavily contended, and we don't want to make this our new "starting index".

src/hotspot/share/gc/shenandoah/shenandoah_globals.hpp line 559:

> 557:          range(1, 128)                                                      \
> 558:                                                                             \
> 559:   product(uintx, ShenandoahDirectAllocationMaxProbes, 3, EXPERIMENTAL,      \

I think we found that setting DirectAllocationMaxProbes to equal ShenandoahDirectlyAlloctableRegionCount works "best".  I'm inclined to remove this parameter entirely as it somewhat simplifies the implementation.  If you think we want to keep it, can you explain the rationale?  Would we change the default value?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2495713187
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2495759091
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2495768272
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2495789372
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2495974976
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2495837037
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2495793783
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2496020674
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2496020913
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2496084186
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2496124916
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2496344767
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2496359448
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2496109695
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2496045832
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2496047006
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2496050940
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2496384148
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2496395761
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2496399061