RFR: 8259643: ZGC can return metaspace OOM prematurely [v2]

Fri Jan 29 11:26:40 UTC 2021

On Thu, 28 Jan 2021 17:54:25 GMT, Erik Österlund <eosterlund at openjdk.org> wrote:

>> Hi Erik,
>> 
>> lets see if I understand the problem:
>> 
>> 1 n threads allocate metaspace
>> 2 thread A gets an allocation error (not HWM but a hard limit)
>> 3 .. returns, (eventually) schedules a synchronous GC.
>> 4 .. gc runs, the CLDG is at some point purged, releases metaspace pressure
>> 5 other threads load classes, allocating metaspace concurrently, benefitting from the released pressure
>> 6 thread A reattempts allocation, fails again.
>> 
>> This is normally not a problem, no? Which thread exactly gets the OOM if the VM hovers that close to the limit does not really matter. But you write "This can cause a situation where almost the entire metaspace is unreachable from roots, yet we throw OOM." - so we could get OOMs even if most of the Metaspace were vacant? This only can happen if, between (4) and (6), other threads not only allocate metaspace, but also then loose the loaders used for those allocations, to late for the GC at (4) to collect them but before (5). Collecting them would require another GC.
>> 
>> In other words, the contract is that we only throw an OOM if we really tried everything, but since the effects of the first GC are "stale" it does not count as try?
>> 
>> Do you think this is a realistic problem?
>> 
>> Do I understand your patch right in that you divide allocations in two priority classes, add another lock, MetaspaceCritical_lock, which blocks normal allocations as long as critical allocations are queued?
>> 
>> Sorry if I am slow :)
>> 
>> ---
>> 
>> One problem I see is that Metaspace::purge is not the full purge. Reclaiming metaspace happens in two stages:
>> 1) in CLDG::purge, we delete all `ClassLoaderMetaspace` objects belonging to dead loaders. This releases all their metaspace to the freelists, optionally uncommitting portions of it (since JEP387).
>> 2) in Metaspace::purge, we go through Metaspace and munmap any mappings which are now completely vacant.
>> 
>> The metaspace pressure release already happens in (1), so any concurrent thread allocating will benefit already. 
>> 
>> ---
>> 
>> Why do we even need a queue? Why could we not just let the first thread attempting a synchronous gc block metaspace allocation path for all threads, including others running into a limit, until the gc is finished and it had its first-allocation-right served?
>> 
>> Thanks, Thomas
>
> Hi Thomas,
> 
> Thanks for chiming in! I will reply inline.
> 
>> Hi Erik,
>> 
>> lets see if I understand the problem:
>> 
>> 1 n threads allocate metaspace
>> 2 thread A gets an allocation error (not HWM but a hard limit)
>> 3 .. returns, (eventually) schedules a synchronous GC.
>> 4 .. gc runs, the CLDG is at some point purged, releases metaspace pressure
>> 5 other threads load classes, allocating metaspace concurrently, benefitting from the released pressure
>> 6 thread A reattempts allocation, fails again.
> 
> That's the one.
> 
>> This is normally not a problem, no? 
> 
> Indeed, and that was my gut feeling when the current handling was written. I wouldn't expect an actual application to ever hit this problem. Nevertheless, I think it's a soundness problem with completely unbounded starvation, even though it doesn't happen in real life. So I think I would still like to fix the issue.
> 
> It is definitely a bit arbitrary though where we decide to draw the line of what we guarantee, and what the balance between exactness and maintainability should be. My aim is to try hard enough so we don't rely on luck (random sleeps) if a program that shouldn't be even close to OOM will fail or not, even though you have to be *very* unlucky for it to fail. But I am not interested in prefect guarantees either, down to the last allocation. My balance is that I allow throwing OOM prematurely if we are "really close" to being down to the last allocation OOM, but if you are not "really close", then no sleep in the world should cause a failure.
> 
>> Which thread exactly gets the OOM if the VM hovers that close to the limit does not really matter. But you write "This can cause a situation where almost the entire metaspace is unreachable from roots, yet we throw OOM." - so we could get OOMs even if most of the Metaspace were vacant? This only can happen if, between (4) and (6), other threads not only allocate metaspace, but also then loose the loaders used for those allocations, to late for the GC at (4) to collect them but before (5). Collecting them would require another GC.
> 
> Right, and that is indeed what the test does. It loads chunks of 1000 classes and releases them, assuming that surely after releasing them, I can allocate more classes. Unless of course starvation ruins the day.
> 
>> In other words, the contract is that we only throw an OOM if we really tried everything, but since the effects of the first GC are "stale" it does not count as try?
> 
> The previous contract was that we try to allocate again after a full GC, and if that fails we give up. The issue is that this guarantee allows the GC to free up 99% of metaspace, yet still fail the allocation due to races with other threads doing the same thing. So at any given point, it might be that only 1% of metadata is reachable, yet an OOM can be produced if you are "unlucky".
> 
>> Do you think this is a realistic problem?
> 
> It is realistic enough that one stress test has failed in the real world. Yet I don't think any application out there will run into any issue. But I prefer having a sound solution where we can know that and not rely on probability.
> 
>> Do I understand your patch right in that you divide allocations in two priority classes, add another lock, MetaspaceCritical_lock, which blocks normal allocations as long as critical allocations are queued?
> 
> Yes, that's exactly right.
> 
>> Sorry if I am slow :)
> 
> Not at all!
> 
>> One problem I see is that Metaspace::purge is not the full purge. Reclaiming metaspace happens in two stages:
>> 
>>     1. in CLDG::purge, we delete all `ClassLoaderMetaspace` objects belonging to dead loaders. This releases all their metaspace to the freelists, optionally uncommitting portions of it (since JEP387).
>> 
>>     2. in Metaspace::purge, we go through Metaspace and munmap any mappings which are now completely vacant.
>> 
>> 
>> The metaspace pressure release already happens in (1), so any concurrent thread allocating will benefit already.
> 
> Ah. I thought it was all done in 2. I can move the Big Fat Lock to cover all of CLDG::purge instead. What do you think? It just needs to cover the entire thing basically.
> 
>> Why do we even need a queue? Why could we not just let the first thread attempting a synchronous gc block metaspace allocation path for all threads, including others running into a limit, until the gc is finished and it had its first-allocation-right served?
> 
> Each "critical" allocation rides on one particular GC cycle, that denotes the make-or-break point of the allocation.
> In order to prevent starvation, we have to satisfy all critical allocations who have their make-or-break GC cycle associated with the current purge() operation, before we release the lock in purge(), letting new allocations in, or we will rely on luck again. However, among the pending critical allocations, they will all possibly have different make-or-break GC cycles associated with them. So in purge() some of them need to be satisfied, and others do not, yet can happily get their allocations satisfied opportunistically if possible. So we need to make sure they are ordered somehow, such that the earliest arriving pending critical allocations are satisfied first, before subsequent critical allocations (possibly waiting for a later GC), or we can get premature OOM situations again, where a thread releases a bunch of memory, expecting to be able to allocate, yet fails due to races with various threads.
> The queue basically ensures the ordering of critical allocation satisfaction is sound, so that the pending critical allocations with the associated make-or-break GC being the one running purge(), are satisfied first, before satisfying (opportunistically) other critical allocations, that are really waiting for the next GC to happen.
> 
> Thanks,
> 
> 
>> Thanks, Thomas

Hi Erik,

thanks for the extensive explanations!

One issue with your patch just came to me: the block-on-allocate may be too early. `Metaspace::allocate()` is a bit hot. I wonder about the performance impact of pulling and releasing a lock on each atomar allocation, even if its uncontended. Ideally I'd like to keep this path as close to a simple pointer bump allocation as possible (which it isn't unfortunately).

It is also not necessary: the majority of callers satisfy their allocation from already-committed arena local memory. So they are good and no thieves. We would block them unnecessary. I estimate only about 1:60 to 1:1000 calls would need that lock.

Allocation happens (roughly) in these steps:
1 try allocate from arena local free block list
2 try allocate from arena local current chunk without newly committing memory
3 try enlarge the chunk in place and/or commit more chunk memory and allocate from current chunk
4 get a new chunk from the freelist

(1) and (2) don't bother anyone. Hot path is typically (2). From (3) onward concurrently released memory could be used. So (1) and (2) can still happen before your block.

All that happens inside `MetaspaceArena::allocate`:

https://github.com/openjdk/jdk/blob/0675473486bc0ee321654d308b600874cf5ce41e/src/hotspot/share/memory/metaspace/metaspaceArena.cpp#L225

But code-wise (2) and (3) are a bit entangled, so the code would have to be massaged a bit to clearly express (2) from (3).

Please find more remarks inline.

> Hi Thomas,
> 
> Thanks for chiming in! I will reply inline.
> 
> > Hi Erik,
> > lets see if I understand the problem:
> > 1 n threads allocate metaspace
> > 2 thread A gets an allocation error (not HWM but a hard limit)
> > 3 .. returns, (eventually) schedules a synchronous GC.
> > 4 .. gc runs, the CLDG is at some point purged, releases metaspace pressure
> > 5 other threads load classes, allocating metaspace concurrently, benefitting from the released pressure
> > 6 thread A reattempts allocation, fails again.
> 
> That's the one.
> 
> > This is normally not a problem, no?
> 
> Indeed, and that was my gut feeling when the current handling was written. I wouldn't expect an actual application to ever hit this problem. Nevertheless, I think it's a soundness problem with completely unbounded starvation, even though it doesn't happen in real life. So I think I would still like to fix the issue.
> 
> It is definitely a bit arbitrary though where we decide to draw the line of what we guarantee, and what the balance between exactness and maintainability should be. My aim is to try hard enough so we don't rely on luck (random sleeps) if a program that shouldn't be even close to OOM will fail or not, even though you have to be _very_ unlucky for it to fail. But I am not interested in prefect guarantees either, down to the last allocation. My balance is that I allow throwing OOM prematurely if we are "really close" to being down to the last allocation OOM, but if you are not "really close", then no sleep in the world should cause a failure.

I think I get this now. IIUC we have the problem that memory release happens delayed, and we carry around a baggage of "potentially free" memory which needs a collector run to materialize. So many threads jumping up and down and doing class loading and unloading drive up metaspace use rate and increase the "potentially free" overhead, right? So the thing is to time collector runs right.

One concern I have is that if the customer runs with too tight a limit, we may bounce from full GC to full GC. Always scraping the barrel enough to just keep going - maybe collecting some short lived loaders - but not enough to get the VM into clear waters. I think this may be an issue today already. What is unclear to me is when it would be just better to give up and throw an OOM. To motivate the customer to increase the limits.

> 
> > Which thread exactly gets the OOM if the VM hovers that close to the limit does not really matter. But you write "This can cause a situation where almost the entire metaspace is unreachable from roots, yet we throw OOM." - so we could get OOMs even if most of the Metaspace were vacant? This only can happen if, between (4) and (6), other threads not only allocate metaspace, but also then loose the loaders used for those allocations, to late for the GC at (4) to collect them but before (5). Collecting them would require another GC.
> 
> Right, and that is indeed what the test does. It loads chunks of 1000 classes and releases them, assuming that surely after releasing them, I can allocate more classes. Unless of course starvation ruins the day.
> 
> > In other words, the contract is that we only throw an OOM if we really tried everything, but since the effects of the first GC are "stale" it does not count as try?
> 
> The previous contract was that we try to allocate again after a full GC, and if that fails we give up. The issue is that this guarantee allows the GC to free up 99% of metaspace, yet still fail the allocation due to races with other threads doing the same thing. So at any given point, it might be that only 1% of metadata is reachable, yet an OOM can be produced if you are "unlucky".
> 
> > Do you think this is a realistic problem?
> 
> It is realistic enough that one stress test has failed in the real world. Yet I don't think any application out there will run into any issue. But I prefer having a sound solution where we can know that and not rely on probability.
> 
> > Do I understand your patch right in that you divide allocations in two priority classes, add another lock, MetaspaceCritical_lock, which blocks normal allocations as long as critical allocations are queued?
> 
> Yes, that's exactly right.
> 
> > Sorry if I am slow :)
> 
> Not at all!
> 
> > One problem I see is that Metaspace::purge is not the full purge. Reclaiming metaspace happens in two stages:
> > ```
> > 1. in CLDG::purge, we delete all `ClassLoaderMetaspace` objects belonging to dead loaders. This releases all their metaspace to the freelists, optionally uncommitting portions of it (since JEP387).
> > 
> > 2. in Metaspace::purge, we go through Metaspace and munmap any mappings which are now completely vacant.
> > ```
> > 
> > 
> > The metaspace pressure release already happens in (1), so any concurrent thread allocating will benefit already.
> 
> Ah. I thought it was all done in 2. I can move the Big Fat Lock to cover all of CLDG::purge instead. What do you think? It just needs to cover the entire thing basically.

Why not just cover the whole synchronous GC collect call? I'd put that barrier up as early as possible, to prevent as many threads as possible from entering the more expensive fail path. At that point we know we are near exhaustion. Any thread allocating could just as well wait inside MetaspaceArena::allocate. If the GC succeeds in releasing lots of memory, they will not have been disturbed much.

> 
> > Why do we even need a queue? Why could we not just let the first thread attempting a synchronous gc block metaspace allocation path for all threads, including others running into a limit, until the gc is finished and it had its first-allocation-right served?
> 
> Each "critical" allocation rides on one particular GC cycle, that denotes the make-or-break point of the allocation.

I feel like I should know this, but if multiple threads enter satisfy_failed_metadata_allocation around the same time and call a synchronous collect(), they would wait on the same GC, right? They won't start individual GCs for each thread?

> In order to prevent starvation, we have to satisfy all critical allocations who have their make-or-break GC cycle associated with the current purge() operation, before we release the lock in purge(), letting new allocations in, or we will rely on luck again. However, among the pending critical allocations, they will all possibly have different make-or-break GC cycles associated with them. So in purge() some of them need to be satisfied, and others do not, yet can happily get their allocations satisfied opportunistically if possible. So we need to make sure they are ordered somehow, such that the earliest arriving pending critical allocations are satisfied first, before subsequent critical allocations (possibly waiting for a later GC), or we can get premature OOM situations again, where a thread releases a bunch of memory, expecting to be able to allocate, yet fails due to races with various threads.
> The queue basically ensures the ordering of critical allocation satisfaction is sound, so that the pending critical allocations with the associated make-or-break GC being the one running purge(), are satisfied first, before satisfying (opportunistically) other critical allocations, that are really waiting for the next GC to happen.

I still don't get why the order of the critical allocations matter. I understand that even with your barrier, multiple threads can fail the initial allocation, enter the "satisfy_failed_metadata_allocation()" path, and now their allocation count as critical since if they fail again they will throw an OOM. But why does the order of the critical allocations among themselves matter? Why not just let the critical allocations trickle out unordered? Is the relation to the GC cycle not arbitrary?

> 
> Thanks,
> /Erik
> 

We need an "erik" pr command :)

Just FYI, I have very vague plans to extend usage of the metaspace allocator to other areas. To get back the cost of implementation. Eg one candidate may be replacement of the current resource areas, which are just more primitive arena based allocators. This is very vague and not a high priority, but I am a bit interested in keeping the coding generic if its not too much effort. But I don't see your patch causing any problems there.

Cheers, Thomas

-------------

PR: https://git.openjdk.java.net/jdk/pull/2289