RFR: 8259643: ZGC can return metaspace OOM prematurely [v2]

Mon Mar 8 14:43:09 UTC 2021

On Mon, 8 Mar 2021 14:03:18 GMT, Erik Österlund <eosterlund at openjdk.org> wrote:

>> Hi Erik,
>> 
>> sorry for the delay, I got swamped. 
>> 
>> thanks for your patient explanations. I think I understand most of it now, but I still have a number of questions. Also see the code remarks, though most are just more questions. 
>> 
>>> > One issue with your patch just came to me: the block-on-allocate may be too early. `Metaspace::allocate()` is a bit hot. I wonder about the performance impact of pulling and releasing a lock on each atomar allocation, even if its uncontended. Ideally I'd like to keep this path as close to a simple pointer bump allocation as possible (which it isn't unfortunately).
>>> 
>>> I have a global flag that denotes there being at least 1 critical allocation in the system. It is set by the first critical allocation, and cleared by the GC if all critical allocations could be satisfied. The global lock is only taken in Metaspace::allocate() if said flag is on. Normal apps should never see this flag being on. So the only overhead in the common case is to check the flag and see that no locking is required. I think that should be fast enough, right? And when you are in the critical mode, you obviously have more to worry about than lock contention, in terms of how the system is performing.
>> 
>> I overlooked the flag check. I think this is fine then in its current form.
>> 
>>> > 
>>> > I think I get this now. IIUC we have the problem that memory release happens delayed, and we carry around a baggage of "potentially free" memory which needs a collector run to materialize. So many threads jumping up and down and doing class loading and unloading drive up metaspace use rate and increase the "potentially free" overhead, right? So the thing is to time collector runs right.
>>> 
>>> Exactly.
>>> 
>>> > One concern I have is that if the customer runs with too tight a limit, we may bounce from full GC to full GC. Always scraping the barrel enough to just keep going - maybe collecting some short lived loaders - but not enough to get the VM into clear waters. I think this may be an issue today already. What is unclear to me is when it would be just better to give up and throw an OOM. To motivate the customer to increase the limits.
>>> 
>>> Right - this is a bit of a philosophical one. There is always a balance there between precision, code complexity, and when to put the VM out of its misery when it is performing poorly. We deal with the same trade-off really with heap allocations, which is why I am also solving the starvation problem in pretty much the same way: with a queue satisfied by the GC, and locking out starvation. Then we throw OOM in fairly similar conditions. What they have in common is that when they throw, we will have a live portion of metaspace that is "pretty much" full, and there is no point in continuing, while allowing unfortunate timings on a less full (in terms of temporal liveness) metaspace to always succeed.
>>> 
>>> One might argue that the trade-off should be moved in some direction, and that it is okay for it to be more or less exact, but I was hoping that by doing the same dance that heap OOM situations do, we can at least follow a trade-off that is pretty well established and has worked pretty well for heap OOM situations for many years. And I think heap OOM situations in the wild are far more common than metaspace OOM, so I don't think that the metaspace OOM mechanism needs to do better than what the heap OOM mechanism does. If that makes sense.
>> 
>> It makes sense. If its an established proven pattern lets use it here too. Its not that complex.
>> 
>> I think there are differences between heap and metaspace in elasticity. Metaspace is way less "spongy", chance of recovering metaspace is slimmer than with the heap, so a Full GC is more expensive in relation to its effects. I think I have seen series of Full GCs on quite empty heaps when customers set MaxMetaspaceSize too low (people seem to like doing that). I'm worried about small loaders with short lifetimes which just allow the VM to reclaim enough to keep going. Reflection invocation loaders, or projects like jruby. But I think this is not the norm, nor does it have anything to do with your patch. Typically we do just one futile GC and then throw OOM.
>> 
>>> > 
>>> > Why not just cover the whole synchronous GC collect call? I'd put that barrier up as early as possible, to prevent as many threads as possible from entering the more expensive fail path. At that point we know we are near exhaustion. Any thread allocating could just as well wait inside MetaspaceArena::allocate. If the GC succeeds in releasing lots of memory, they will not have been disturbed much.
>>> 
>>> Do you mean
>>> 
>>> 1. why I don't hold the MetaspaceCritical_lock across the collect() call at the mutator side of the code, or
>>> 2. why I don't hold the MetaspaceCritical_lock across the entire GC cycle instead of purge?
>>> 
>>> I think you meant 2), so will answer that:
>>> a) Keeping the lock across the entire GC cycle is rather problematic when it comes to constraining lock ranks. It would have to be above any lock ever needed in an entire GC cycle, yet we take a bunch of other locks that mutators hold at different point, interacting with class unloading. It would be very hard to find the right rank for this.
>>> b) The GC goes in and out of safepoints, and needs to sometimes block out safepoints. Holding locks in and out of safepoints while blocking in and out safepoints, is in my experience rather deadlock prone.
>>> c) There is no need to keep the lock across more than the operation that frees metaspace memory, which in a concurrent GC always happens when safepoints are blocked out. If a mutator succeeds during a concurrent GC due to finding memory in a local free list or something, while another allocation failed and needs a critical allocation, then that is absolutely fine, as the successful allocation is never what causes the failing allocation to fail.
>> 
>> Thanks, that makes sense. I was completely offtrack, thinking more in direction of (1), wrt the current coding, not your patch. I thought if the mutator thread locks across the collect call and the subsequent allocation attempt (which would have to be some sort of priority allocation, ignoring the lock) this would be a simple solution. But I think that has a number of holes, never mind the allocation request ordering.
>> 
>>> 
>>> > > > Why do we even need a queue? Why could we not just let the first thread attempting a synchronous gc block metaspace allocation path for all threads, including others running into a limit, until the gc is finished and it had its first-allocation-right served?
>>> > > 
>>> > > 
>>> > > Each "critical" allocation rides on one particular GC cycle, that denotes the make-or-break point of the allocation.
>>> > 
>>> > 
>>> > I feel like I should know this, but if multiple threads enter satisfy_failed_metadata_allocation around the same time and call a synchronous collect(), they would wait on the same GC, right? They won't start individual GCs for each thread?
>>> 
>>> The rule we have in ZGC to ensure what we want in terms of OOM situations, is that GCs that are _requested_ before an equivalent GC _starts_, can be satisfied with the same GC cycle.
>>> 
>>> Let's go through what will likely happen in practice with my solution when you have, let's say 1000 concurrent calls to satisfy a failing metaspace allocation.
>>> 
>>> 1. Thread 1 registers its allocation, sees it is the first one, and starts a metaspace GC.
>>> 2. GC starts running
>>> 3. Threads 2-999 register their allocations, and see that there was already a critical allocation before them. This causes them to wait for the GC to purge, opportunistically, riding on that first GC.
>>> 4. The GC satisfies allocations. For the sake of example, let's say that allocations 1-500 could be satisfied, but not the rest.
>>> 5. Threads 2-500 who were waiting for purge to finish, wake up, and run off happily with their satisfied allocations.
>>> 6. Threads 501-1000 wake up seeing that their allocations didn't get satisfied. They now stop being opportunistic, and request a gc each before finally giving up.
>>> 7. The first GC cycle finishes.
>>> 8. Thread 1 wakes up after the entire first GC cycle is done and sees its satisfied allocation, happily running off with it.
>>> 9. The next GC cycle starts
>>> 10. The next GC cycle successfully satisfies metadata allocations for threads 501-750, but not the rest.
>>> 11. The entire next GC cycle finishes, satisfying the GC requested by threads 2-1000, as they all _requested_ a metaspace GC, _before_ it started running. Therefore, no further GC will run.
>>> 12. Threads 501-750 wake up, happily running off with their satisfied allocations.
>>> 13. Threads 751-1000 wake up, grumpy about the lack of memory after their GC. They are all gonna throw.
>>> 
>>> So basically, if they can all get satisfied with 1 GC, then 1 GC will be enough. But we won't throw until a thread has had a full GC run _after_ it was requested, but multiple threads can ride on the same GC there. In this example, threads 2-1000 all ride on the same GC.
>> 
>> This is a nice elegant solution. I had some trouble understanding your explanation: When you write "threads 2-1000 all ride on the same GC" this confused me since threads 2-500 were lucky and were satisfied by the purge in GC cycle 1. So I would have expected threads 501-1000 to ride on the second GC, thread 1-500 on the first. Unless with "ride" you mean "guaranteed to be processed"?
>> 
>>> 
>>> Note though, that we would never allow a GC that is already running to satisfy a GC request that comes in while the GC is already running, as we then wouldn't catch situations when a thread releases a lot of memory, and then expects it to be available just after.
>>> 
>>> > > In order to prevent starvation, we have to satisfy all critical allocations who have their make-or-break GC cycle associated with the current purge() operation, before we release the lock in purge(), letting new allocations in, or we will rely on luck again. However, among the pending critical allocations, they will all possibly have different make-or-break GC cycles associated with them. So in purge() some of them need to be satisfied, and others do not, yet can happily get their allocations satisfied opportunistically if possible. So we need to make sure they are ordered somehow, such that the earliest arriving pending critical allocations are satisfied first, before subsequent critical allocations (possibly waiting for a later GC), or we can get premature OOM situations again, where a thread releases a bunch of memory, expecting to be able to allocate, yet fails due to races with various threads.
>>> > > The queue basically ensures the ordering of critical allocation satisfaction is sound, so that the pending critical allocations with the associated make-or-break GC being the one running purge(), are satisfied first, before satisfying (opportunistically) other critical allocations, that are really waiting for the next GC to happen.
>>> > 
>>> > 
>>> > I still don't get why the order of the critical allocations matter. I understand that even with your barrier, multiple threads can fail the initial allocation, enter the "satisfy_failed_metadata_allocation()" path, and now their allocation count as critical since if they fail again they will throw an OOM. But why does the order of the critical allocations among themselves matter? Why not just let the critical allocations trickle out unordered? Is the relation to the GC cycle not arbitrary?
>>> 
>>> If we don't preserve the order, we would miss situations when a thread releases a large chunk of metaspace (e.g. by releasing a class loader reference), and then expects that memory to be available. An allocation from a critical allocation that is associated with a subsequent GC, could starve a thread that is associated with the current GC cycle, hence causing a premature OOM for that thread, while not really needing that allocation until next GC cycle, while doing it in the right order would satisfy both allocations.
>>> 
>>> One might argue it might be okay to throw in such a scenario with an unordered solution. We are pretty close to running out of memory anyway I guess. But I'm not really sure what such a solution would look like in more detail, and thought writing this little list was easy enough, and for me easier to reason about, partially because we do the same dance in the GC to deal with heap OOM, which has been a success.
>> 
>> I think I get this now. Its still a brain teaser if you are not used to the pattern, but I think I got most of it. If the pattern is well established and proofed it makes sense.
>> 
>> Thanks for the great explanations! 
>> 
>>> 
>>> Thanks,
>>> /Erik
>>> 
>>> ^--- I know this will be treated as a PR bot command, but I can't give up on my slash!
>>> 
>> 
>> Alias it to integrate :)
>> 
>>> 
>>> > Just FYI, I have very vague plans to extend usage of the metaspace allocator to other areas. To get back the cost of implementation. Eg one candidate may be replacement of the current resource areas, which are just more primitive arena based allocators. This is very vague and not a high priority, but I am a bit interested in keeping the coding generic if its not too much effort. But I don't see your patch causing any problems there.
>>> 
>>> That sounds very interesting. We had internal discussions about riding on the MetaspaceExpand lock which I believe would also work, but thought this really ought to be a separate thing so that we don't tie ourselves too tight to the internal implementation of the allocator. Given that you might want to use the allocator for other stuff, it sounds like that is indeed the right choice.
>> 
>> Yes, I think this is definitely better. The expand lock (I renamed it to be just the Metaspace_lock since its also used for reclamation and other stuff) is used in a more fine granular fashion. I cannot see it working in the same way as the critical lock, protecting the queue and preventing entry for non-priority allocation.
>> 
>> Thanks, Thomas
>
>> Hi Erik,
>> 
>> sorry for the delay, I got swamped.
> 
> Hi Thomas,
> 
> Sorry I also got swamped and still am swamped. Hope it is okay if I come back to this a bit later; there are higher priority items for me to deal with at the moment.
> 
> Thanks
> -- Erik

Hi Eric,

please feel free to commit this, and answer my question at your leisure, if at all. I am fine with your change as it is (now that I understand it :)

Cheers, Thomas

-------------

PR: https://git.openjdk.java.net/jdk/pull/2289