RFR: 8259643: ZGC can return metaspace OOM prematurely [v5]

Mon Nov 15 17:52:34 UTC 2021

On Mon, 15 Nov 2021 16:43:19 GMT, Erik Österlund <eosterlund at openjdk.org> wrote:

>> There exists a race condition for ZGC metaspace allocations, where an allocation can throw OOM due to unbounded starvation from other threads. Towards the end of the allocation dance, we conceptually do this:
>> 
>> 1. full_gc()
>> 2. final_allocation_attempt()
>> 
>> And if we still fail at 2 after doing a full GC, we conclude that there isn't enough metaspace memory. However, if the thread gets preempted between 1 and 2, then an unbounded number of metaspace allocations from other threads can fill up the entire metaspace, making the final allocation attempt fail and hence throw. This can cause a situation where almost the entire metaspace is unreachable from roots, yet we throw OOM. I managed to reproduce this with the right sleeps.
>> 
>> The way we deal with this particular issue for heap allocations, is to have an allocation request queue, and satisfy those allocations before others, preventing starvation. My solution to this metaspace OOM problem will be to basically do exactly that - have a queue of "critical" allocations, that get precedence over normal metaspace allocations.
>> 
>> The solution should work for other concurrent GCs (who likely have the same issue), but I only tried this with ZGC, so I am only hooking in ZGC to the new API (for concurrently unloading GCs to manage critical metaspace allocations) at this point.
>> 
>> Passes ZGC tests from tier 1-5, and the particular test that failed (with the JVM sleeps that make it fail deterministically).
>
> Erik Österlund has updated the pull request incrementally with one additional commit since the last revision:
> 
>   style polish in ZGC code

Nice to have you back. Change looks still good.

Thanks, Thomas

-------------

PR: https://git.openjdk.java.net/jdk/pull/2289