RFC: Safe OOM during evac

Thu Oct 19 10:27:23 UTC 2017

Hi all,

I want to outline the problem that we have with OOM during evacuation, 
and summarize what we have so far in order to handle OOM during evac 
correctly, and describe one way that I have in mind how to do it.

The problem appears when a Java threads gets into a write-barrier and 
fails to evacuate the object because it's run out of memory (e.g. both 
GCLAB and shared evac exhausted). In this case we still need to ensure 
to return a singular object, even if it's in from-space, otherwise we 
risk inconsistency (subsequent write may end up in wrong object copy). 
However, there might still be another Java thread which succeeds to 
evacuate that same object at the ~ same time because it still has GCLAB 
left.

Here's what we came up with so far in IRC discussions:

- Throw OOME. This is the absolute minimum solution, and we should 
probably just do that right now until we implemented a better one (and 
we might even use it as fallback for solutions that are not 100% proof). 
This is better IMO than to pretend we're ok and risk heap inconsistencies.

- Make write-barrier slow-path/runtime-calls non-leaf calls. Then we 
could just safepoint and do a full-GC *while we're in the barrier*. This 
would be to most correct solution. Unfortunately it means it would make 
it very hard to optimize the write-barriers in C2, and the performance 
impact is likely not acceptable. We may try to do a prototype (again) 
and see how far Roland can take it though. The problem here is that we 
need debug info at the call sites, and C2 maintains debug info only at 
certain points in the ideal graph. Consequently, we can move 
write-barriers only to such points and not as freely as we can do now.

- Keep an evacuation reserve that we use only for evacuations or maybe 
even only for write-barriers or maybe even only as fallback for 
write-barriers that OOM'ed. This does very likely solve it for 99.999.. 
% of the cases, but discussions on IRC have shown that it is very hard 
to come up with a 100% safe upper bound for this reserve size, that 
allows us to theoretically prove that OOM during evac cannot ever 
happen. We might combine this with solution #1 though: i.e. make it safe 
in all but the most extreme pathologic cases, and throw OOME if we hit a 
wall. I am still not very happy with the prospect to fail in extreme 
rare cases, possibly in production environments under high pressure.

- Extend the brooks pointer protocol to prevent concurrent evacs. Let me 
outline my idea here:

If a write barrier runs OOM, we need to prevent other threads from 
successfully evacuating 'our' object. We can do so by CASing an 
'impossible' value into its brooks pointers: this guarantees that other 
threads fail to successfully install a brooks ptr *OR* give us the other 
thread's copy (which would be fine too). Problem: we need to deal with 
that special value everywhere else, most importantly in read-barriers. 
The best thing I could come up with so far is to use $OBJECT_ADDR | 1 as 
blocker value, i.e. CAS-set the lowest bit in the self-pointing brooks 
ptr. This can easily be decoded in read-barriers (and all other relevant 
code) by masking out that lowest bit using AND ~1. Full-GC would fix the 
brooks ptr to normal value. I don't have a good feeling what the 
performance impact would be. Something similar happens for decoding 
compressed oops, and that is commonly accepted (but is less frequent). 
The actual brooks-ptr-load probably dominates the masking and we 
wouldn't even really notice? On the upside, this makes the 
oom_during_evac() path truly non-blocking: we don't need to wait for GC 
workers and not for other Java threads and not for evacuation to be 
turned off or any such thing. (which also means, it truly complies with 
being non-blocking for leaf-calls). I believe it's a correct solution 
too: no from-space copy can slip through it. I can imagine to come up 
with a prototype for this and make it optional (by a flag) so that we 
can measure its impact or even give the option to combine it with any of 
the other options we have (e.g. evac-reserve).