RFR: Shrink tlab to capacity

Mon Dec 12 21:36:36 UTC 2022

On Fri, 9 Dec 2022 23:23:43 GMT, Kelvin Nilsen <kdnilsen at openjdk.org> wrote:

> When a TLAB request exceeds the currently available memory within young-gen, the existing behavior is to reject the TLAB request outright. This is recognized as a failed allocation request, which triggers degenerated GC.
> 
> This change introduces code to reduce the likelihood that too-large TLAB requests will be issued, and when they are issued, it makes an effort to shrink the TLAB request in order to reduce the need for degenerated GC.
> 
> The impact is difficult to measure because this situation is fairly rare.  On one Extremem workload, the TLAB-shrinking code is exercised only once during a 16-minute run involving 500 concurrent GCs, a 45 GiB heap, and a 28 GiB young-gen size.  The change reduces the degenerated GCs from 6 to 5.
> 
> One reason that the remaining 5 degenerated GCs are not addressed by this change is that further work is required to handle a situation in which a requested TLAB is smaller than the available young-gen memory, but available memory is set aside in the evacuation reserve so cannot be provided to a mutator.  Future work will address this condition.

Looks good, modulo a comment I left inline in the  ShenandoahHeap::allocate_memory_under_lock() method.

src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp line 1368:

> 1366:   // satisfy the allocation request.  The reality is the actual TLAB size is likely to be even smaller, because it will
> 1367:   // depend on how much memory is available within mutator regions that are not yet fully used.
> 1368:   HeapWord* result = allocate_memory_under_lock(smaller_req, in_new_region, is_promotion);

Can you help me understand the structure here.

Would it not have been simpler to keep sufficient state at the point where the attempt to allocate the larger size failed, and we decided we would shrink the size of the request, to just make the smaller allocation request which would be guaranteed to succeed because we held the heap lock at that point already? Is there a reason to give up and reattempt the smaller allocation request afresh?

I realize you explicitly added a scope to make this re-attempt outside the scope of the locker and make the recursive call, but am trying to understand the rationale for doing so. Perhaps it's because I am missing the big picture of the work being done here from various callers to this method, but may be you can help clarify that a bit.

-------------

Marked as reviewed by ysr (Author).

PR: https://git.openjdk.org/shenandoah/pull/180