RFR: 8361099: Shenandoah: Improve heap lock contention by using CAS for memory allocation [v14]

Wed Jan 7 00:36:20 UTC 2026

On Wed, 3 Dec 2025 01:09:34 GMT, Xiaolong Peng <xpeng at openjdk.org> wrote:

>> src/hotspot/share/gc/shenandoah/shenandoahAllocator.cpp line 100:
>> 
>>> 98: HeapWord* ShenandoahAllocator<ALLOC_PARTITION>::attempt_allocation(ShenandoahAllocRequest& req, bool& in_new_region) {
>>> 99:   if (_alloc_region_count == 0u) {
>>> 100:     ShenandoahHeapLocker locker(ShenandoahHeap::heap()->lock(), _yield_to_safepoint);
>> 
>> Looking for more comments here as well.  What does it mean that _alloc_region_count == 0?  Does this mean we have not yet initialized the directly allocatable regions (following a particular GC event)?  Or does it mean that we have depleted all of the available regions and we are out of memory? In the first case, it seems we would want to replenish our supply of directly allocatable regions while we hold the GC lock.  In the second case, it seems there's really no value in even attempting a slow allocation.  (If we were unable to refresh our directly allocatable regions, then it will not find allocatable memory even on the other side of the heap lock...)
>
> I'll add comments on this, _alloc_region_count == 0 means we don't want to use any shared alloc region, it will always allocate with a heap lock, ideally the performance should be same as before, so it always simply find a region with enough space and allocate in the region.

Put the comments describing functions in the .hpp file, where they are currently.  But we need to enhance those comments.

>> src/hotspot/share/gc/shenandoah/shenandoahAllocator.cpp line 158:
>> 
>>> 156:   if (r != nullptr) {
>>> 157:     bool ready_for_retire = false;
>>> 158:     obj = atomic_allocate_in(r, false, req, in_new_region, ready_for_retire);
>> 
>> Not sure why we use atomic_allocate_in() here.  We hold the heap lock so we don't need to use atomic operations.  
>> We should clarify with comments.
>
> It is not really necessary to `atomic_allocate_in` here, but I wanted reuse some of the codes in atomic_allocate_in, we can discuss this later, I can change it back to non-atomic version.

Would prefer not to use the atomic_allocate code here.  If you want to reuse code, maybe you can refactor allocate_in with an <ATOMIC> template argument.

I notice that this PR makes lots of ShenandoahHeapRegion variables volatile: _age, _youth, _top, _tlab_allocs, _gclab_allocs, _plab_allocs.  That will cause less efficient code to be generated whenever we are accessing this data from "behind" the GC heap lock, which should be rare, I guess.

I raised some concerns/issues about the race that happens when we move a region from the directly-allocatable set into the global-heap-lock-protected ShenandoahFreeSet partition.

Here's the scenario that I'm concerned about:

1. A mutator obtains pointer to directly allocatable region R
2. A second mutator performs a refresh, moving region R out of directly allocatable set (for whatever reason)
3. Region R is now eligible to satisfy allocations from behind the global heap lock
4. Some third mutator thread acquires the heap lock and fetches top for region $
5. The first mutator performs its allocation within the same region R, not recognizing a CAS conflict
6. This third mutator allocates from region R at top, without using CAS.  So both mutators think they own the same object

I think this is not a problem if the "only" reason at step 2 above that we move region R out of directly allocatable set is because R is ready to be retired.  In that case, there will be no subsequent heap-locked allocations in regions R.  However, I anticipate the day in not-too-distant future when we will want to refresh regions even when they are not ready to be retired.  Specifically, as we move rebuild-freeset out of safepoints, we will want to refresh regions before we acquire heap-lock to do rebuild, with the goal of making sure there is sufficient directly allocatable memory available that no mutator will be stalled because it needs to allocate during the time that the heap remains locked for the rebuild operation.

So I suppose that if we always use atomic_allocate() even for allocations that happen while holding the heap lock, we won't have this problem.  If we decide to keep this architecture, there should be comments explaining why we are doing it this way.  (I am not real happy that we have to "pay the cost" of CAS in addition to paying the cost of global heap lock, but I think these allocations should be very rare.  It seems this would only come up if, for example, a mutator wanted to allocate an object that is 1/2 the heap region size, and none of the directly allocatable regions have that much available memory.)

My original proposal was to have a volatile_top which is used by CAS allocation and a nonvolatile_top that is used by heap-lock allocation.  When we make a region directly allocatable, we copy its nonvolatile_top to the volatile_top.  When we take a directly allocatable region and move it into the heap-locked freeset, we use CAS to set its volatile_top to end before we place the region into the freeset partition, assigning to nonvolatile_top the value held in volatile_top before the CAS operation.

Whatever solution is used for this needs to be documented in the code.  Feel free to copy and paste from this github comment.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2665714073
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2666236427