RFR: 8292697: ZGC: Hangs when almost out of metaspace memory [v2]

Thu Aug 25 06:09:30 UTC 2022

> HotSpot performs "critical metaspace allocations" when it's running out of metaspace. The failed allocations are registered in a queue, which the GC prioritizes when cleaning up the metaspace. There's a race in the code that handles these requests.
> 
> These requests are added to the queue, and the GC will "process" each request in turn when it runs the metaspace purge phase. The queue handling has an optimization that says that only the first request in the queue needs to trigger the GC, all subsequent requests in the queue will wait for that GC. When the GC gets to the purge phase it will mark in all requests that they have been processed. Note: that this doesn't mean that the request was satisfied, it could be that the result was NULL (and the thread will trigger a last-effort GC before it throws an OOME).
> 
> The bug is in the code that determines if a request is responsible for triggering a new GC. The current code just checks if the current request is first in the queue. This doesn't work if the code is called just after the GC has run purge, but before the old requests have been removed. The new request sees that there are already elements in the queue, so it doesn't trigger the GC. And at the same time, the old requests have been processed and they won't trigger the GC either. So, now the new request is waiting for a GC that will not be triggered by anyone.
> 
> Note: The reason why there's a delay between the GC processing a request, and the removal from the queue, is that the Java thread that added the request is also responsible for removing the request form the queue. The reason for this is that the mentioned last-effort GC, needs to be able to process the request a second time.
> 
> The fix-proposal is to let threads adding new requests check if the added request is the first *non-processed* request in the queue. If it is, that request/thread is responsible for triggering the GC for itself, and any subsequently added requests (until the GC runs the next round of request processing).
> 
> However, there's a snatch to this proposal. The request processing is done inside `Metaspace::purge()`, and that function is skipped if the GC didn't unload any classes. The proposed logic relies on that function to always be run when a GC is running. So, I've also changed so that the GC unconditionally calls the request processing. An alternative would be to always run the Metaspace::purge() code. That might even help return memory from temporarily allocated metaspace memory earlier, but I've left that exercise for a potential future improvement.
> 
> I've also tweaked the test so that we get a bit more info if this test fails again.
> 
> Testing: I could reliably reproduce the original hang on my macbook laptop, within a few minutes. With this fix I can run the test in a loop for hours without reproducing the hang. I've tested this together with Generational ZGC code, running tier1-tier7 on Linux x64. I've started more extensive testing on openjdk/jdk.

Stefan Karlsson has updated the pull request incrementally with one additional commit since the last revision:

  Update test text

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/9985/files
  - new: https://git.openjdk.org/jdk/pull/9985/files/aa89d763..0186424d

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=9985&range=01
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9985&range=00-01

  Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/9985.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/9985/head:pull/9985

PR: https://git.openjdk.org/jdk/pull/9985