RFC: Safe OOM during evac
Roman Kennke
rkennke at redhat.com
Thu Oct 19 10:27:23 UTC 2017
Hi all,
I want to outline the problem that we have with OOM during evacuation,
and summarize what we have so far in order to handle OOM during evac
correctly, and describe one way that I have in mind how to do it.
The problem appears when a Java threads gets into a write-barrier and
fails to evacuate the object because it's run out of memory (e.g. both
GCLAB and shared evac exhausted). In this case we still need to ensure
to return a singular object, even if it's in from-space, otherwise we
risk inconsistency (subsequent write may end up in wrong object copy).
However, there might still be another Java thread which succeeds to
evacuate that same object at the ~ same time because it still has GCLAB
left.
Here's what we came up with so far in IRC discussions:
- Throw OOME. This is the absolute minimum solution, and we should
probably just do that right now until we implemented a better one (and
we might even use it as fallback for solutions that are not 100% proof).
This is better IMO than to pretend we're ok and risk heap inconsistencies.
- Make write-barrier slow-path/runtime-calls non-leaf calls. Then we
could just safepoint and do a full-GC *while we're in the barrier*. This
would be to most correct solution. Unfortunately it means it would make
it very hard to optimize the write-barriers in C2, and the performance
impact is likely not acceptable. We may try to do a prototype (again)
and see how far Roland can take it though. The problem here is that we
need debug info at the call sites, and C2 maintains debug info only at
certain points in the ideal graph. Consequently, we can move
write-barriers only to such points and not as freely as we can do now.
- Keep an evacuation reserve that we use only for evacuations or maybe
even only for write-barriers or maybe even only as fallback for
write-barriers that OOM'ed. This does very likely solve it for 99.999..
% of the cases, but discussions on IRC have shown that it is very hard
to come up with a 100% safe upper bound for this reserve size, that
allows us to theoretically prove that OOM during evac cannot ever
happen. We might combine this with solution #1 though: i.e. make it safe
in all but the most extreme pathologic cases, and throw OOME if we hit a
wall. I am still not very happy with the prospect to fail in extreme
rare cases, possibly in production environments under high pressure.
- Extend the brooks pointer protocol to prevent concurrent evacs. Let me
outline my idea here:
If a write barrier runs OOM, we need to prevent other threads from
successfully evacuating 'our' object. We can do so by CASing an
'impossible' value into its brooks pointers: this guarantees that other
threads fail to successfully install a brooks ptr *OR* give us the other
thread's copy (which would be fine too). Problem: we need to deal with
that special value everywhere else, most importantly in read-barriers.
The best thing I could come up with so far is to use $OBJECT_ADDR | 1 as
blocker value, i.e. CAS-set the lowest bit in the self-pointing brooks
ptr. This can easily be decoded in read-barriers (and all other relevant
code) by masking out that lowest bit using AND ~1. Full-GC would fix the
brooks ptr to normal value. I don't have a good feeling what the
performance impact would be. Something similar happens for decoding
compressed oops, and that is commonly accepted (but is less frequent).
The actual brooks-ptr-load probably dominates the masking and we
wouldn't even really notice? On the upside, this makes the
oom_during_evac() path truly non-blocking: we don't need to wait for GC
workers and not for other Java threads and not for evacuation to be
turned off or any such thing. (which also means, it truly complies with
being non-blocking for leaf-calls). I believe it's a correct solution
too: no from-space copy can slip through it. I can imagine to come up
with a prototype for this and make it optional (by a flag) so that we
can measure its impact or even give the option to combine it with any of
the other options we have (e.g. evac-reserve).
More information about the shenandoah-dev
mailing list