RFR: 8361099: Shenandoah: Improve heap lock contention by using CAS for memory allocation
Kelvin Nilsen
kdnilsen at openjdk.org
Mon Sep 22 08:46:40 UTC 2025
On Mon, 7 Jul 2025 19:56:30 GMT, Xiaolong Peng <xpeng at openjdk.org> wrote:
> Shenandoah always allocates memory with heap lock, we have observed heavy heap lock contention on memory allocation path in performance analysis of some service in which we tried to adopt Shenandoah. This change is to propose an optimization for the code path of mutator memory allocation to improve heap lock contention, at vey high level, here is how it works:
> * ShenandoahFreeSet holds a N (default to 13) number of ShenandoahHeapRegion* which are used by mutator threads for regular object allocations, they are called shared regions/directly allocatable regions, which are stored in PaddedEnd data structure(padded array).
> * Each mutator thread will be assigned one of the directly allocatable regions, the thread will try to allocate in the directly allocatable region with CAS atomic operation, if fails will try 2 more consecutive directly allocatable regions in the array storing directly allocatable region.
> * If mutator thread fails after trying 3 directly allocatable regions, it will:
> * Take heap lock
> * Try to retire the directly allocatable regions which are ready to retire.
> * Iterator mutator partition and allocate directly allocatable regions and store to the padded array if any need to be retired.
> * Satisfy mutator allocation request if possible.
>
>
> I'm not expecting significant performance impact for most of the cases since in most case the contention on heap lock it not high enough to cause performance issue, I have done many tests, here are some of them:
>
> 1. Dacapo lusearch test on EC2 host with 96 CPU cores:
> Openjdk TIP:
>
> [ec2-user at ip-172-31-42-91 jdk]$ ./master-jdk/bin/java -XX:-TieredCompilation -XX:+AlwaysPreTouch -Xms4G -Xmx4G -XX:+UseShenandoahGC -XX:+UnlockExperimentalVMOptions -XX:+UnlockDiagnosticVMOptions -XX:-ShenandoahUncommit -XX:ShenandoahGCMode=generational -XX:+UseTLAB -jar ~/tools/dacapo/dacapo-23.11-MR2-chopin.jar -n 10 lusearch | grep "metered full smoothing"
> ===== DaCapo tail latency, metered full smoothing: 50% 131684 usec, 90% 200192 usec, 99% 211369 usec, 99.9% 212517 usec, 99.99% 213043 usec, max 235289 usec, measured over 524288 events =====
> ===== DaCapo tail latency, metered full smoothing: 50% 1568 usec, 90% 36101 usec, 99% 42172 usec, 99.9% 42928 usec, 99.99% 43100 usec, max 43305 usec, measured over 524288 events =====
> ===== DaCapo tail latency, metered full smoothing: 50% 52644 usec, 90% 124393 usec, 99% 137711 usec, 99.9% 139355 usec, 99.99% 139749 usec, max 146722 usec, measured over 524288 events ====...
Thanks for doing this work. It is huge progress. I've left a number of comments. I didn't have time to study/comment on all of the code, as I am out-of-office for rest of this week. I'll look more when I return.
src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 246:
> 244: _leftmosts_empty[int(ShenandoahFreeSetPartitionId::Mutator)] = mutator_leftmost_empty;
> 245: _rightmosts_empty[int(ShenandoahFreeSetPartitionId::Mutator)] = mutator_rightmost_empty;
> 246:
We have the heap lock here, so should not need to use atomic store operations. Atomic operations have a performance penalty that I think we want to avoid.
src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 247:
> 245: _rightmosts_empty[int(ShenandoahFreeSetPartitionId::Mutator)] = mutator_rightmost_empty;
> 246:
> 247: Atomic::store(_region_counts + int(ShenandoahFreeSetPartitionId::Mutator), mutator_region_count);
If we do need to use Atomic operations, would prefer &_region_counts[int(ShenandoahFreeSetPartitionId::Mutator)] notation for the array elements.
src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 282:
> 280:
> 281: void ShenandoahRegionPartitions::increase_used(ShenandoahFreeSetPartitionId which_partition, size_t bytes) {
> 282: shenandoah_assert_heaplocked();
Are you now calling this directly from the CAS allocators? So you want to not have to assert heap locked and that is why we make the used accounting atomic?
My preference would be to avoid this need by counting the entire region as used at the time the region becomes directly allocatable.
src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 615:
> 613: size_t capacity = _free_set->alloc_capacity(i);
> 614: bool is_empty = (capacity == _region_size_bytes);
> 615: // TODO remove assert, not possible to pass when allow mutator to allocate w/o lock.
Probably the preferred approach here is to "pre-retire" regions when they are made directly allocatable. When the region is pre-retired, it is taken out of the partition, so assert_bounds no longer applies to this region.
src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 2151:
> 2149:
> 2150: template<bool IS_TLAB>
> 2151: HeapWord* ShenandoahFreeSet::par_allocate_single_for_mutator(ShenandoahAllocRequest &req, bool &in_new_region) {
Not clear to me what prefix par_ represents. Parallel allocate (without lock?)
src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 2168:
> 2166: for (uint i = 0u; i < max_probes; i++) {
> 2167: ShenandoahHeapRegion** shared_region_address = _directly_allocatable_regions + idx;
> 2168: ShenandoahHeapRegion* r = Atomic::load_acquire(shared_region_address);
This code has a lot more synchronization overhead than what is required for CAS allocations. load_acquire() forces a memory fence. All writes performed by other threads before the store_release() must be visible to this thread upon return from load_acquire. I would like to see some documentation that describes the coherency model that we assume/require here. Can we write this up as a comment in the header file?
Personal preference: I think there are many situations where we get better performance if we allow ourselves to see slightly old data, and we can argue that the slightly old data is "harmless". For example, if some other thread replaces the directly_allocatable_region[N] while we're attempting to allocate from directly_allocatable_region[N], we might attempt to allocate from the original region and fail. That's harmless. We'll just retry at the next probe point. If multiple probes fail to allocate, we'll take the synchronization lock and everything will be resolved there. The accumulation of atomic volatile access has a big impact on performance. I've measured this in previous experiments. You can do the same.
src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 2191:
> 2189: }
> 2190: // If any of the 3 consecutive directly allocatable regions is ready for retire and replacement,
> 2191: // grab heap lock try to retire all ready-to-retire shared regions.
Would be preferable to allow this thread to retire all ready-to-retire regions in the directly allocatable set (not just the three that I know about) while it holds the heap lock. We do not necessarily need to keep a separate per-thread representation of ready-to-retire shared regions. This is a very rare event. Just iterate through the 13 (or whatever) regions in the directly allocatable set and ask for each whether end-top is less than min plab size.
src/hotspot/share/gc/shenandoah/shenandoahFreeSet.hpp line 81:
> 79: // are denoted in bytes. Note that some regions that had been assigned to a particular partition at rebuild time
> 80: // may have been retired following the rebuild. The tallies for these regions are still reflected in _capacity[p]
> 81: // and _used[p], even though the region may have been removed from the free set.
Prefer not to make these volatile, as that imposes a compiler overhead.
src/hotspot/share/gc/shenandoah/shenandoahFreeSet.hpp line 226:
> 224: inline size_t available_in(ShenandoahFreeSetPartitionId which_partition) const {
> 225: assert (which_partition < NumPartitions, "selected free set must be valid");
> 226: return capacity_of(which_partition) - used_by(which_partition);
We can't do this outside of a lock. The value fetched for capacity may be incoherent with the value fetched for used (because each is modified independently). That is why the previous version of the code used the available[] array and computed its value while holding the lock.
src/hotspot/share/gc/shenandoah/shenandoahFreeSet.hpp line 232:
> 230: // returned without acquiring the lock. In debug builds, the global heap lock is acquired in order to
> 231: // enforce a consistency assert.
> 232: inline size_t available_in_not_locked(ShenandoahFreeSetPartitionId which_partition) const {
These changes are beyond the scope of planned topic. I think we need to consider them more carefully. Would prefer not to mix the two. (and I personally believe the original implementation has better performance, but feel free to prove me wrong.)
src/hotspot/share/gc/shenandoah/shenandoahFreeSet.hpp line 274:
> 272: };
> 273:
> 274: #define DIRECTLY_ALLOCATABLE_REGION_UNKNOWN_AFFINITY ((Thread*)-1)
Out of time to dive deep into this right now. Wonder if it makes sense to randomly generate a hash for each thread and store this into a thread-local field. Might provide "randomness" and locality.
-------------
PR Review: https://git.openjdk.org/jdk/pull/26171#pullrequestreview-2998610364
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2193123384
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2193127191
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2193133676
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2193188287
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2195596428
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2195616392
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2195638643
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2193242080
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2193250484
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2195588615
PR Review Comment: https://git.openjdk.org/jdk/pull/26171#discussion_r2195661494
More information about the hotspot-gc-dev
mailing list