RFR: 8259643: ZGC can return metaspace OOM prematurely [v2]

Thu Jan 28 16:26:40 UTC 2021

On Thu, 28 Jan 2021 13:37:16 GMT, Erik Österlund <eosterlund at openjdk.org> wrote:

>> Marked as reviewed by pliden (Reviewer).
>
> Thanks for the reviews @pliden and @stefank.

Hi Erik,

lets see if I understand the problem:

1 n threads allocate metaspace
2 thread A gets an allocation error (not HWM but a hard limit)
3 .. returns, (eventually) schedules a synchronous GC.
4 .. gc runs, the CLDG is at some point purged, releases metaspace pressure
5 other threads load classes, allocating metaspace concurrently, benefitting from the released pressure
6 thread A reattempts allocation, fails again.

This is normally not a problem, no? Which thread exactly gets the OOM if the VM hovers that close to the limit does not really matter. But you write "This can cause a situation where almost the entire metaspace is unreachable from roots, yet we throw OOM." - so we could get OOMs even if most of the Metaspace were vacant? This only can happen if, between (4) and (6), other threads not only allocate metaspace, but also then loose the loaders used for those allocations, to late for the GC at (4) to collect them but before (5). Collecting them would require another GC.

In other words, the contract is that we only throw an OOM if we really tried everything, but since the effects of the first GC are "stale" it does not count as try?

Do you think this is a realistic problem?

Do I understand your patch right in that you divide allocations in two priority classes, add another lock, MetaspaceCritical_lock, which blocks normal allocations as long as critical allocations are queued?

Sorry if I am slow :)

---

One problem I see is that Metaspace::purge is not the full purge. Reclaiming metaspace happens in two stages:
1) in CLDG::purge, we delete all `ClassLoaderMetaspace` objects belonging to dead loaders. This releases all their metaspace to the freelists, optionally uncommitting portions of it (since JEP387).
2) in Metaspace::purge, we go through Metaspace and munmap any mappings which are now completely vacant.

The metaspace pressure release already happens in (1), so any concurrent thread allocating will benefit already. 

---

Why do we even need a queue? Why could we not just let the first thread attempting a synchronous gc block metaspace allocation path for all threads, including others running into a limit, until the gc is finished and it had its first-allocation-right served?

Thanks, Thomas

-------------

PR: https://git.openjdk.java.net/jdk/pull/2289