RFR: 8257774: G1: Trigger collect when free region count drops below threshold to prevent evacuation failures

Wed Dec 16 19:19:04 UTC 2020

On Thu, 10 Dec 2020 11:58:26 GMT, Thomas Schatzl <tschatzl at openjdk.org> wrote:

>> Bursts of short lived Humongous object allocations can cause GCs to be initiated with 0 free regions. When these GCs happen they take significantly longer to complete. No objects are evacuated so there is a large amount of time spent in reversing self forwarded pointers and the only memory recovered is from the short lived humongous objects. My proposal is to add a check to the slow allocation path which will force a GC to happen if the number of free regions drops below the amount that would be required to complete the GC if it happened at that moment. The threshold will be based on the survival rates from Eden and survivor spaces along with the space required for Tenure space evacuations.
>> 
>> The goal is to resolve the issue with bursts of short lived humongous objects without impacting other workloads negatively. I would appreciate reviews and any feedback that you might have. Thanks.
>> 
>> Here are the links to the threads on the mailing list where I initially discussion the issue and my idea to resolve it:
>> https://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2020-November/032189.html
>> https://mail.openjdk.java.net/pipermail/hotspot-gc-dev/2020-December/032677.html
>
> test/hotspot/jtreg/gc/g1/TestGCLogMessages.java line 316:
> 
>> 314:         private static byte[] garbage;
>> 315:         private static byte[] largeObject;
>> 316:         private static Object[] holder = new Object[800]; // Must be larger than G1EvacuationFailureALotCount
> 
> Just curious about these changes: it is not immediately obvious to me why they are necessary as the mechanism to force evacuation failure (G1EvacuationFailureALotCount et al) should be independent of these changes.
> 
> And the 17MB (for the humonguous object)+ 16MB of garbage should be enough for at least one gc; but maybe these changes trigger an early gc?

Yes the GC was being triggered earlier and not getting the evacuation failure. With these adjustments it consistently gets an evacuation failure. I was seeing this with my initial prototype so I will verify that it is still required.

> src/hotspot/share/gc/g1/g1VMOperations.hpp line 71:
> 
>> 69: class VM_G1CollectForAllocation : public VM_CollectForAllocation {
>> 70:   bool _gc_succeeded;
>> 71:   bool _force_gc;
> 
> Not completely happy about using an extra flag for these forced GCs here; also this makes them indistinguishable with other GCs in the logs as far as I can see.
> What do you think about adding  GCCause(s) instead to automatically make them stand out in the logs?
> Or at least make sure that they stand out in the logs for debugging issues.

I will remove the _force_gc flag and add a Preemptive GCCause.

> src/hotspot/share/gc/g1/g1Policy.hpp line 102:
> 
>> 100: 
>> 101:   size_t _predicted_survival_bytes_from_survivor;
>> 102:   size_t _predicted_survival_bytes_from_old;
> 
> As a non-English native speaker I think "survival_bytes" is strange as "survival" isn't an adjective. Maybe "surviving_bytes" sounds better?
> The code in `calculate_required_regions_for_next_collect` uses "survivor" btw. I would still prefer "surviving" in some way as it differs from the "survivor" in "survivor regions", but let's keep nomenclature at least consistent.
> 
> Also please add a comment what these are used for.

I will change the names and properly comment them.

> src/hotspot/share/gc/g1/g1Policy.hpp line 368:
> 
>> 366:                                                  uint& num_optional_regions);
>> 367: 
>> 368:   bool can_mutator_consume_free_regions(uint region_count);
> 
> Comments missing.

I will add a comment in my next revision

> src/hotspot/share/gc/g1/g1CollectedHeap.cpp line 423:
> 
>> 421:   for (uint try_count = 1, gclocker_retry_count = 0; /* we'll return */; try_count += 1) {
>> 422:     bool should_try_gc;
>> 423:     bool force_gc = false;
> 
> `force_gc` and `should_try_gc` seems to overlap a bit here. At least the naming isn't perfect because we may not do a gc even if `force_gc` is true which I'd kind of expect.
> 
> I do not have a good new name right now how to fix this.

It will be removed as part of my next round of changes

> src/hotspot/share/gc/g1/g1CollectedHeap.cpp line 428:
> 
>> 426:     {
>> 427:       MutexLocker x(Heap_lock);
>> 428:       if (policy()->can_mutator_consume_free_regions(1)) {
> 
> I would prefer if `force_gc` (or whatever name it will have) would be set here unconditionally as the `else` is pretty far away here.
> 
> I.e.
> force_gc = policy()->can_mutator_consume_free_regions(1);
>   
>   if (force_gc) { // needing to use the name "force_gc" here shows that the name is wrong...
>     ... try allocation
>     ... check if we should expand young gen beyond regular size due to GCLocker
>   }
> The other issue I have with using `can_mutator_consume_free_regions()` here is that there is already a very similar `G1Policy::should_allocate_mutator_region`; and anyway, the `attempt_allocation_locked` call may actually succeed without requiring a new region (actually, it is not uncommon that another thread got a free region while trying to take the `Heap_lock`.
> 
> I think a better place for `can_mutator_consume_free_regions()` is in `G1Policy::should_allocate_mutator_region()` for this case.
> 
> `attempt_allocation_locked` however does not return a reason for why allocation failed (at the moment). Maybe it is better to let it return a tuple with result and reason (or a second "out" parameter)? (I haven't tried how this would look like, it seems worth trying and better than the current way of handling this).
> 
> This one could be used in the following code.

For this code I am investigating moving the check into either 'G1Policy::should_allocate_mutator_region()' or to the caller of 'G1Policy::should_allocate_mutator_region()' so I can distinguish the reason why it failed to allocate. Hopefully I have something ready to push this week. I am on vacation so my responses are a little slow

-------------

PR: https://git.openjdk.java.net/jdk/pull/1650