RFR/RFC: Make OOM-during-evacuation race-free

Roman Kennke rkennke at redhat.com
Mon Oct 23 12:44:26 UTC 2017


We have that long-standing single-ugliest spot in Shenandoah that was 
the failure path for when write-barriers would run out-of-memory. We 
added some band-aid for it in the form of spin-waiting until all worker 
threads have settled, and then returning with a read-barrier, and this 
'solves' it for almost all cases, but it's still racy: 2 Java threads 
could compete to evacuate the same object, one fails with OOM, and thus 
potentially returns the from-space copy, while the other may succeed 
(b/c it still has room in its GCLAB left) and makes/returns a to-space 
copy, thus potentially causing inconsistencies.

We cannot trigger a safepoint (and do a full-gc) while we're in the 
write-barrier because we're inside a no-leaf call and don't have any 
debug info. (In fact, per contract, we should not even block until 
workers settled). We cannot make the write-barrier regular leaf calls 
because this would very seriously affect our ability to optimize 
barriers in C2. We cannot even throw an OOME because we have no debug info.

We have this idea still floating around to allocate a reserve. But we 
still haven't found a conclusion how much we actually need to be 100% 
safe. (This seems to be a similarily hard problem like waiting for all 
other threads to settle down. We can solve it for 99.9999...% of cases, 
but so far we failed to come up with a 100% solution due to use of 
GCLABs, related waste, use of multiple threads, the required abilitiy to 
rollback, or else account for non-rollback waste, etc etc).

I therefore propose a different method to make this failure path 100% 
safe: when a thread runs out of memory during evacuation, we install an 
'evac blocker' word in the object's brooks pointer. I am using the 
self-reference ORed with just 1, in other words, I am flipping the 
lowest (otherwise unused) bit of the brooks pointer. Since we're never 
asking to CAS a fwd pointer with an expected old-value that has this 
bits set, this means that all future requests to evacuate that object 
must fail until we clear that bit again. OR, another thread may succeed 
to evac the object in question, in which case we'd get that evacuated 
object and we're fine. This is safe, it's atomic, it doesn't block, it 
finally makes the no-leaf write-barrier slow-path call valid, and we can 
sleep at night because we don't have this miniscule chance of failing 
with OOM-during-evac in production, etc ;-)

This change requires that we mask away any lowest bit that could 
possibly be set in read- and write-barriers, such that we *never* hand 
out that lowest bit to code that doesn't expect it. This potentially 
impacts performance of read-barriers. I have run and posted comparison 
benchmarks [0] which doesn't show any measurable performance impact.

All this stuff is enabled by -XX:+ShenandoahSafeOOMDuringEvac which 
defaults to false for now.

I also added a flag -XX:+ShenandoahOOMDuringEvacALot which simulates 
frequent alloc failures in the write-barrier. This, in conjunction with 
aggressive heuristics, helped me tremendously to iron out bugs in my 
implementation. I have added such test runs to GCBasher, GCOld and 
GCLocker tests. Notice that this wasn't enough to provoke an actual race 
in the old code, at least not in my experiments, but it should make it 
much more likely.

I also added a solution to the original problem that trigger all this 
discussion: when our pinning code gets an evac failure, it goes on and 
attempts to pin a from-space region. Using this new safe way out, we may 
actually do that, but only on the cancelled path. Later when we're 
unpinning the object, we flip the region back to regular. As far as I 
can tell, this is ok. Please let me know if you find a counter-example.

I would like to get your comments on the patch, and approvals from 
everybody (Christine, Aleksey, Zhengyu and Roland) before pushing. In 
particular, I'd be happy if Aleksey could run some read-barrier heavy 
gc-benchmarks to measure the (worst-case) impact of masking in the 
read-barriers.

http://cr.openjdk.java.net/~rkennke/safe-oom-during-evac/webrev.02/ 
<http://cr.openjdk.java.net/%7Erkennke/safe-oom-during-evac/webrev.02/>

Cheers, Roman

[0] 
http://mail.openjdk.java.net/pipermail/shenandoah-dev/2017-October/004118.html



More information about the shenandoah-dev mailing list