RFR: 8361099: Shenandoah: Improve heap lock contention by using CAS for memory allocation

Mon Sep 22 08:46:43 UTC 2025

On Tue, 8 Jul 2025 17:55:33 GMT, Kelvin Nilsen <kdnilsen at openjdk.org> wrote:

>> Shenandoah always allocates memory with heap lock, we have observed heavy heap lock contention on memory allocation path in performance analysis of some service in which we tried to adopt Shenandoah. This change is to propose an optimization for the code path of mutator memory allocation to improve heap lock contention, at vey high level, here is how it works:
>> * ShenandoahFreeSet holds a N (default to 13) number of ShenandoahHeapRegion* which are used by mutator threads for regular object allocations, they are called shared regions/directly allocatable regions, which are stored in PaddedEnd data structure(padded array).
>> * Each mutator thread will be assigned one of the directly allocatable regions, the thread will try to allocate in the directly allocatable region with CAS atomic operation, if fails will try 2 more consecutive  directly allocatable regions in the array storing directly allocatable region.
>> * If mutator thread fails after trying 3 directly allocatable regions, it will:
>>    * Take heap lock
>>    * Try to retire the directly allocatable regions which are ready to retire.
>>    *  Iterator mutator partition and allocate directly allocatable regions and store to the padded array if any need to be retired.
>>    *  Satisfy mutator allocation request if possible.
>> 
>> 
>> I'm not expecting significant performance impact for most of the cases since in most case the contention on heap lock it not high enough to cause performance issue, I have done many tests, here are some of them:
>> 
>> 1. Dacapo lusearch test on EC2 host with 96 CPU cores:
>> Openjdk TIP:
>> 
>> [ec2-user at ip-172-31-42-91 jdk]$ ./master-jdk/bin/java -XX:-TieredCompilation -XX:+AlwaysPreTouch -Xms4G -Xmx4G -XX:+UseShenandoahGC -XX:+UnlockExperimentalVMOptions -XX:+UnlockDiagnosticVMOptions  -XX:-ShenandoahUncommit -XX:ShenandoahGCMode=generational  -XX:+UseTLAB -jar ~/tools/dacapo/dacapo-23.11-MR2-chopin.jar  -n 10 lusearch  | grep "metered full smoothing"
>> ===== DaCapo tail latency, metered full smoothing: 50% 131684 usec, 90% 200192 usec, 99% 211369 usec, 99.9% 212517 usec, 99.99% 213043 usec, max 235289 usec, measured over 524288 events =====
>> ===== DaCapo tail latency, metered full smoothing: 50% 1568 usec, 90% 36101 usec, 99% 42172 usec, 99.9% 42928 usec, 99.99% 43100 usec, max 43305 usec, measured over 524288 events =====
>> ===== DaCapo tail latency, metered full smoothing: 50% 52644 usec, 90% 124393 usec, 99% 137711 usec, 99.9% 139355 usec, 99.99% 139749 usec, max 146722 ...
>
> src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 246:
> 
>> 244:   _leftmosts_empty[int(ShenandoahFreeSetPartitionId::Mutator)] = mutator_leftmost_empty;
>> 245:   _rightmosts_empty[int(ShenandoahFreeSetPartitionId::Mutator)] = mutator_rightmost_empty;
>> 246: 
> 
> We have the heap lock here, so should not need to use atomic store operations.  Atomic operations have a performance penalty that I think we want to avoid.

Yes, we do have heap lock. But mutator may also update the used/available after allocation w/o heap lock, so the update for mutator partition need to be done with atomic operation, but not all of them. 

I'll revisit this part and remove the atomic operations which are not needed.

> src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 247:
> 
>> 245:   _rightmosts_empty[int(ShenandoahFreeSetPartitionId::Mutator)] = mutator_rightmost_empty;
>> 246: 
>> 247:   Atomic::store(_region_counts + int(ShenandoahFreeSetPartitionId::Mutator), mutator_region_count);
> 
> If we do need to use Atomic operations, would prefer &_region_counts[int(ShenandoahFreeSetPartitionId::Mutator)] notation for the array elements.

It make sense, I'll update it,  `[]` operation for array always has boundary check.

> src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 282:
> 
>> 280: 
>> 281: void ShenandoahRegionPartitions::increase_used(ShenandoahFreeSetPartitionId which_partition, size_t bytes) {
>> 282:   shenandoah_assert_heaplocked();
> 
> Are you now calling this directly from the CAS allocators?  So you want to not have to assert heap locked and that is why we make the used accounting atomic?
> 
> My preference would be to avoid this need by counting the entire region as used at the time the region becomes directly allocatable.

Yes, we increase used after CAS allocation, so it can be called w/o lock.

I'll revisit it along with the the comments you have above, if we count the entire region as used when we reserve a region for direct allocation, this change can be removed.

> src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 2151:
> 
>> 2149: 
>> 2150: template<bool IS_TLAB>
>> 2151: HeapWord* ShenandoahFreeSet::par_allocate_single_for_mutator(ShenandoahAllocRequest &req, bool &in_new_region) {
> 
> Not clear to me what prefix par_ represents.  Parallel allocate (without lock?)

Yes.
The name is a bit confusing, what about `try_allocate_single_for_mutator`?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2198891380
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2198896336
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2198902300
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2198905623