RFR: 8324649: Shenandoah: refactor implementation of free set [v6]

Sat Feb 3 08:55:06 UTC 2024

On Wed, 31 Jan 2024 16:28:15 GMT, Kelvin Nilsen <kdnilsen at openjdk.org> wrote:

>> Several objectives:
>> 1. Reduce humongous allocation failures by segregating regular regions from humongous regions
>> 2. Do not retire regions just because an allocation failed within the region if the memory remaining within the region is large enough to represent a LAB
>> 3. Track range of empty regions in addition to range of available regions in order to expedite humongous allocations
>> 4. Treat collector reserves as available for Mutator allocations after evacuation completes
>> 5. Improve encapsulation so as to enable an OldCollector reserve for future integration of generational Shenandoah
>> 
>> On internal performance pipelines, this change shows:
>> 
>>  1. some Increase in page faults and rss_max with certain workloads, presumably because of "segregation" of humongous from regular regions.
>>  2. An increase in System CPU time on certain benchmarks: sunflow (+165%), scimark.sparse.large (+50%), lusearch (+43%).  This system CPU time increase appears to correlate with increased page faults and/or rss.
>>  3. An increase in trigger_failure for the hyperalloc_a2048_o4096 experiment (not yet understood)
>>  4. 2-30x improvements on multiple metrics of the Extremem phased workload latencies (most likely resulting from fewer degenerated or full GCs)
>> 
>> Shenandoah
>> -------------------------------------------------------------------------------------------------------
>> +166.55% scimark.sparse.large/minor_page_fault_count p=0.00000
>>   Control: 819938.875   (+/-5724.56  )         40
>>   Test:    2185552.625   (+/-26378.64  )         20
>> 
>> +166.16% scimark.sparse.large/rss_max p=0.00000
>>   Control: 3285226.375   (+/-22812.93  )         40
>>   Test:    8743881.500   (+/-104906.69  )         20
>> 
>> +164.78% sunflow/cpu_system p=0.00000
>>   Control:      1.280s  (+/-  0.10s )         40
>>   Test:         3.390s  (+/-  0.13s )         20
>> 
>> +149.29% hyperalloc_a2048_o4096/trigger_failure p=0.00000
>>   Control:      3.259   (+/-  1.46  )         33
>>   Test:         8.125   (+/-  2.05  )         20
>> 
>> +143.75% pmd/major_page_fault_count p=0.03622
>>   Control:      1.000   (+/-  0.00  )         40
>>   Test:         2.438   (+/-  2.59  )         20
>> 
>> +80.22% lusearch/minor_page_fault_count p=0.00000
>>   Control: 2043930.938   (+/-4777.14  )         40
>>   Test:    3683477.625   (+/-5650.29  )         20
>> 
>> +50.50% scimark.sparse.small/minor_page_fault_count p=0.00000
>>   Control: 697899.156   (+/-3457.82  )         40
>>   Test:    1050363.812   (+/-175...
>
> Kelvin Nilsen has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Rename and comments for _capacity_of and _used_by

I believe I am done with the last round and have for the most part completed most of the feedback. Sorry for the delay, and thanks for your patience!

Can take another look once the feedback has been addressed. Promise to be more prompt on the re-review since I have been over this once already so it should be relatively quick.

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 155:

> 153: // Remove this region from its free partition, but leave its capacity and used as part of the original free partition's totals.
> 154: // When retiring a region, add any remnant of available memory within the region to the used total for the original free partition.
> 155: void ShenandoahRegionPartition::retire_within_partition(size_t idx, size_t used_bytes) {

Why is the method called `retire_within_partition()` instead of `retire_from_partition()` ?

(i.e. why _within_ partition, since it's leaving its free partition?)

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 266:

> 264: size_t ShenandoahRegionPartition::leftmost_empty(ShenandoahFreeSetPartitionId which_partition) {
> 265:   assert (which_partition > NotFree && which_partition < NumPartitions, "selected free partition must be valid");
> 266:   for (size_t idx = _leftmosts_empty[which_partition]; idx < _max; idx++) {

See next comment. Would it simplify matters to index all these loops with `ssize_t`. I am thinking it should work more cleanly than these multiple variations to deal with underflow in a few places(see comment below).

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 279:

> 277: inline size_t ShenandoahRegionPartition::rightmost_empty(ShenandoahFreeSetPartitionId which_partition) {
> 278:   assert (which_partition > NotFree && which_partition < NumPartitions, "selected free partition must be valid");
> 279:   for (intptr_t idx = _rightmosts_empty[which_partition]; idx >= 0; idx--) {

`ssize_t` instead of `intptr_t`.
I'd change the signatures of all these to use `ssize_t` just to avoid these somewhat awkward constructs.

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 287:

> 285:   _leftmosts_empty[which_partition] = _max;
> 286:   _rightmosts_empty[which_partition] = 0;
> 287:   return 0;

To my earlier comment of using `ssize_t`, that would allow us to signal failure here by returning a -1.

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 441:

> 439:       // size_t is unsigned, need to dodge underflow when _leftmost = 0
> 440:       // Fast-path: try to allocate in the collector view first
> 441:       for (size_t c = _partitions.rightmost(Collector) + 1; c > _partitions.leftmost(Collector); c--) {

Use `ssize_t` for c.

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 442:

> 440:       // Fast-path: try to allocate in the collector view first
> 441:       for (size_t c = _partitions.rightmost(Collector) + 1; c > _partitions.leftmost(Collector); c--) {
> 442:         size_t idx = c - 1;

Here and further below (lines 457-458), you start at rightmost + 1, check if it's greater than leftmost, then reduce the index by 1 (since you had started to the right of the rightmost) to check. Couldn't you simply start at rightmost, check if it was >= leftmost, and use it?

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 457:

> 455: 
> 456:       // Try to steal an empty region from the mutator view.
> 457:       for (size_t c = _partitions.rightmost_empty(Mutator) + 1; c > _partitions.leftmost_empty(Mutator); c--) {

`ssize_t` to keep all these loops uniform.

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 467:

> 465:               log_debug(gc, free)("Flipped region " SIZE_FORMAT " to gc for request: " PTR_FORMAT, idx, p2i(&req));
> 466:               return result;
> 467:             }

It seems like this can cause potentially many (because of the loop) Mutator regions to flip to Collector (can we call the method `flip_to_collector`?) sometimes even when the request won't be satisfied. Why not flip to Collector only _after_ the allocation is successful? I assume the attempt to allocate would run afoul of assertion checks if it happened before the flip, but I worry about flipping a bunch of stuff unnecessarily and failing to allocate in them after all. Is that futile flipping cause for concern? Can it be avoided (e.g. by repositioning the assertion checks using a proxy variable to signal intent to flip following a successful allocation, then using it to ensure the post-allocation flip, or something similar)? May be such futile flips are uncommon and not a cause for concern?

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 550:

> 548:     // allocate within.  This was observed to result in large amounts of available memory being ignored
> 549:     // following a failed shared allocation request.  TLAB requests will generally downsize to absorb all
> 550:     // memory available within the region even if this is less than the desired size.

I don't understand this comment, since you are it seems to me retiring the region below at line 553. (Also see comment elsewhere on calling the method `retire_within_partition`, instead of the more natural (to me) `retire_from_partition`.

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 683:

> 681: // move some of the mutator regions into the collector partition with the intent of packing collector memory into the
> 682: //  highest (rightmost) addresses of the heap, with mutator memory consuming the lowest addresses of the heap.
> 683: void ShenandoahFreeSet::find_regions_with_alloc_capacity(size_t &cset_regions) {

This method seems to belong to a public API of `ShenandoahRegionPartitions`. See also comment at the call site of this method.

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 715:

> 713: }
> 714: 
> 715: // Move no more than max_xfer_regions from the existing Collector free partitions to the Mutator free partition.

I'd avoid the somewhat redundant "Mutator free partition" or "Collector free partition", but merely say "Mutator partition" and "Collector partition". I'd reserve the term "free partition" for a partition that is not a NotFree partition. This allows terseness and precision at the same time.

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 761:

> 759: void ShenandoahFreeSet::prepare_to_rebuild(size_t &cset_regions) {
> 760:   shenandoah_assert_heaplocked();
> 761:   // This resets all state information, removing all regions from all partitions.

I thought it makes them all unavailable, placing them all into the NotFree partition.

src/hotspot/share/gc/shenandoah/shenandoahFreeSet.cpp line 766:

> 764: 
> 765:   // This places regions that have alloc_capacity into the mutator partition.
> 766:   find_regions_with_alloc_capacity(cset_regions);

In conjunction with `clear()` above, it looks like we are doing two walks of the _membership array in the implementation as a result of this. Why not just have a single API from `ShenandoahRegionPartitions` that walks over the regions and sorts them into the NotFree or the Mutator partition in one go, rather than one walk to clear and another to then move some into Mutator?

Also the method should probably be renamed to `move_alloc_regions_to_mutator()`, which should be moved into `ShenandoahRegionPartitions` class as a public API for this class `ShenandoahFreeSet` to call.

-------------

Changes requested by ysr (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/17561#pullrequestreview-1860712682
PR Review Comment: https://git.openjdk.org/jdk/pull/17561#discussion_r1477019821
PR Review Comment: https://git.openjdk.org/jdk/pull/17561#discussion_r1477011350
PR Review Comment: https://git.openjdk.org/jdk/pull/17561#discussion_r1477011145
PR Review Comment: https://git.openjdk.org/jdk/pull/17561#discussion_r1477012364
PR Review Comment: https://git.openjdk.org/jdk/pull/17561#discussion_r1477016777
PR Review Comment: https://git.openjdk.org/jdk/pull/17561#discussion_r1477008101
PR Review Comment: https://git.openjdk.org/jdk/pull/17561#discussion_r1477016903
PR Review Comment: https://git.openjdk.org/jdk/pull/17561#discussion_r1477010479
PR Review Comment: https://git.openjdk.org/jdk/pull/17561#discussion_r1477020531
PR Review Comment: https://git.openjdk.org/jdk/pull/17561#discussion_r1476915853
PR Review Comment: https://git.openjdk.org/jdk/pull/17561#discussion_r1476917127
PR Review Comment: https://git.openjdk.org/jdk/pull/17561#discussion_r1476914673
PR Review Comment: https://git.openjdk.org/jdk/pull/17561#discussion_r1476916083