RFR: 8327000: GenShen: Integrate updated Shenandoah implementation of FreeSet into GenShen [v8]

Wed Jun 19 20:54:30 UTC 2024

On Tue, 18 Jun 2024 00:56:30 GMT, Y. Srinivas Ramakrishna <ysr at openjdk.org> wrote:

>> Kelvin Nilsen has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Minor refinements to test programs
>>   
>>   TestAllocIntArrays: comments to explain behavior.
>>   TestOldGrowthTriggers: reduce the number of loop iterations so this test
>>   will not time out on less powerful test platforms.
>
> src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp line 1029:
> 
>> 1027:     //           the better remediation than repeated attempts to allocate following repeated GC cycles.
>> 1028: 
>> 1029:     while ((result == nullptr) && (original_count == shenandoah_policy()->full_gc_count())) {
> 
> I'd go a bot farther in the direction of being conservative and declare best effort only after a full/global GC (whether concurrent or stop-world, although stop-world would be better). I realize this may be much worse latency-wise than customers would care for, but I think the notion of latency being poor for a concurrent collector should probably be separated from the notion of cpu overhead for GC and space remaining after a best effort to reclaim unused memory. I realize that in the concurrent collector case where low latency is a design requirement, this would make the notion of OOM for exceeding gc overhead unlikely to ever engage, with latency already being very poor.
> 
> If we want to control that separately, I'd have some notion of frequency (or time density if you will) of latency-violating events that would trigger an OOM, although that would be new behaviour. Most JVM users typically use JMX/introspection to self-destruct or to raise an alarm about malfunction. It can of course be argued that that would cover the existing OOM upon gc overhead too, because ultimately a very slow JVM because of GC overhead would translate to worsening latency for user code.
> 
> Still, I would keep the notion of gc overhead and space available separate from the _latency_ of a stop-world or concurrent event and always ensure that we look at available space only following a global/full gc to decide if an OOM due to GC overhead should be thrown. Can discuss this more offline in person.

These are subtle issues.  I'm not sure how best to address.  Will go with consensus.  One other observation:

Exception in thread thread_name: java.lang.OutOfMemoryError: GC Overhead limit exceeded
    Cause: The detail message "GC overhead limit exceeded" indicates that the garbage collector is running all the time and Java program is making very slow progress. After a garbage collection, if the Java process is spending more than approximately 98% of its time doing garbage collection and if it is recovering less than 2% of the heap and has been doing so far the last 5 (compile time constant) consecutive garbage collections, then a java.lang.OutOfMemoryError is thrown. This exception is typically thrown because the amount of live data barely fits into the Java heap having little free space for new allocations.

from https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/memleaks002.html

This description aligns with our default behavior of ShenandoahNoProgressThreshold default of 5, which would trigger in this PR's current implementation if we do 5 back-to-back unproductive full GCs (regardless of how much CPU time is consumed by GC vs mutator threads).  Unproductive is defined as NOT is_good_progress(), where is_good_progress() (see shenandoahMetrics.cpp) is:

1. free memory is at least 1% of the heap capacity, and
    a. The amount of used memory is decreased by at least the size of one region, or
    b. We have decreased internal fragmentation by at least 1%, or
    c. We have decreased external fragmentation by at least 1%

One problem that I have observed with certain test programs is that this definition of "is_good_progress" is pretty generous.  We can have many back-to-back full gcs that "make progress" according to this generous definition, while a mutator thread repeatedly retries and fails to satisfy its allocation request.

My judgement is that the mutator is much better served by getting a quick OOM exception rather than chugging away in this retry loop.

> src/hotspot/share/gc/shenandoah/shenandoahHeapRegion.cpp line 132:
> 
>> 130: 
>> 131: 
>> 132: 
> 
> delete. The 3 blank lines seem to be a mismerge?

Thanks.  Removed.

> src/hotspot/share/gc/shenandoah/shenandoahVerifier.cpp line 429:
> 
>> 427:     size_t generation_capacity = generation->max_capacity();
>> 428:     guarantee(stats.span() <= generation_capacity,
>> 429:               "%s: generation (%s) size spanned by regions (" SIZE_FORMAT ") * region size: " PROPERFMT
> 
> For consistency, place the region size value also in circular braces `( ... )` like for the other values?

Thanks.  Fixed.

-------------

PR Review Comment: https://git.openjdk.org/shenandoah/pull/440#discussion_r1646673274
PR Review Comment: https://git.openjdk.org/shenandoah/pull/440#discussion_r1646673607
PR Review Comment: https://git.openjdk.org/shenandoah/pull/440#discussion_r1646673995