RFR/RFC: Make OOM-during-evacuation race-free
Roman Kennke
rkennke at redhat.com
Mon Oct 23 12:44:26 UTC 2017
We have that long-standing single-ugliest spot in Shenandoah that was
the failure path for when write-barriers would run out-of-memory. We
added some band-aid for it in the form of spin-waiting until all worker
threads have settled, and then returning with a read-barrier, and this
'solves' it for almost all cases, but it's still racy: 2 Java threads
could compete to evacuate the same object, one fails with OOM, and thus
potentially returns the from-space copy, while the other may succeed
(b/c it still has room in its GCLAB left) and makes/returns a to-space
copy, thus potentially causing inconsistencies.
We cannot trigger a safepoint (and do a full-gc) while we're in the
write-barrier because we're inside a no-leaf call and don't have any
debug info. (In fact, per contract, we should not even block until
workers settled). We cannot make the write-barrier regular leaf calls
because this would very seriously affect our ability to optimize
barriers in C2. We cannot even throw an OOME because we have no debug info.
We have this idea still floating around to allocate a reserve. But we
still haven't found a conclusion how much we actually need to be 100%
safe. (This seems to be a similarily hard problem like waiting for all
other threads to settle down. We can solve it for 99.9999...% of cases,
but so far we failed to come up with a 100% solution due to use of
GCLABs, related waste, use of multiple threads, the required abilitiy to
rollback, or else account for non-rollback waste, etc etc).
I therefore propose a different method to make this failure path 100%
safe: when a thread runs out of memory during evacuation, we install an
'evac blocker' word in the object's brooks pointer. I am using the
self-reference ORed with just 1, in other words, I am flipping the
lowest (otherwise unused) bit of the brooks pointer. Since we're never
asking to CAS a fwd pointer with an expected old-value that has this
bits set, this means that all future requests to evacuate that object
must fail until we clear that bit again. OR, another thread may succeed
to evac the object in question, in which case we'd get that evacuated
object and we're fine. This is safe, it's atomic, it doesn't block, it
finally makes the no-leaf write-barrier slow-path call valid, and we can
sleep at night because we don't have this miniscule chance of failing
with OOM-during-evac in production, etc ;-)
This change requires that we mask away any lowest bit that could
possibly be set in read- and write-barriers, such that we *never* hand
out that lowest bit to code that doesn't expect it. This potentially
impacts performance of read-barriers. I have run and posted comparison
benchmarks [0] which doesn't show any measurable performance impact.
All this stuff is enabled by -XX:+ShenandoahSafeOOMDuringEvac which
defaults to false for now.
I also added a flag -XX:+ShenandoahOOMDuringEvacALot which simulates
frequent alloc failures in the write-barrier. This, in conjunction with
aggressive heuristics, helped me tremendously to iron out bugs in my
implementation. I have added such test runs to GCBasher, GCOld and
GCLocker tests. Notice that this wasn't enough to provoke an actual race
in the old code, at least not in my experiments, but it should make it
much more likely.
I also added a solution to the original problem that trigger all this
discussion: when our pinning code gets an evac failure, it goes on and
attempts to pin a from-space region. Using this new safe way out, we may
actually do that, but only on the cancelled path. Later when we're
unpinning the object, we flip the region back to regular. As far as I
can tell, this is ok. Please let me know if you find a counter-example.
I would like to get your comments on the patch, and approvals from
everybody (Christine, Aleksey, Zhengyu and Roland) before pushing. In
particular, I'd be happy if Aleksey could run some read-barrier heavy
gc-benchmarks to measure the (worst-case) impact of masking in the
read-barriers.
http://cr.openjdk.java.net/~rkennke/safe-oom-during-evac/webrev.02/
<http://cr.openjdk.java.net/%7Erkennke/safe-oom-during-evac/webrev.02/>
Cheers, Roman
[0]
http://mail.openjdk.java.net/pipermail/shenandoah-dev/2017-October/004118.html
More information about the shenandoah-dev
mailing list