RFR/RFC: Make OOM-during-evacuation race-free

Mon Oct 23 15:39:39 UTC 2017

After some discussions on IRC, we decided to take out the pinning stuff 
from this change. This requires some more careful consideration and a 
new region state. I.e. we need to allow cset -> pinned_cset, and from 
pinned_cset back to cset when unpinning and from pinned_cset to pinned 
when taking the region out of the cset.

So here comes the reduced patch:
http://cr.openjdk.java.net/~rkennke/safe-oom-during-evac/webrev.03/ 
<http://cr.openjdk.java.net/%7Erkennke/safe-oom-during-evac/webrev.03/>

WDYT?

Roman

> We have that long-standing single-ugliest spot in Shenandoah that was 
> the failure path for when write-barriers would run out-of-memory. We 
> added some band-aid for it in the form of spin-waiting until all 
> worker threads have settled, and then returning with a read-barrier, 
> and this 'solves' it for almost all cases, but it's still racy: 2 Java 
> threads could compete to evacuate the same object, one fails with OOM, 
> and thus potentially returns the from-space copy, while the other may 
> succeed (b/c it still has room in its GCLAB left) and makes/returns a 
> to-space copy, thus potentially causing inconsistencies.
>
> We cannot trigger a safepoint (and do a full-gc) while we're in the 
> write-barrier because we're inside a no-leaf call and don't have any 
> debug info. (In fact, per contract, we should not even block until 
> workers settled). We cannot make the write-barrier regular leaf calls 
> because this would very seriously affect our ability to optimize 
> barriers in C2. We cannot even throw an OOME because we have no debug 
> info.
>
> We have this idea still floating around to allocate a reserve. But we 
> still haven't found a conclusion how much we actually need to be 100% 
> safe. (This seems to be a similarily hard problem like waiting for all 
> other threads to settle down. We can solve it for 99.9999...% of 
> cases, but so far we failed to come up with a 100% solution due to use 
> of GCLABs, related waste, use of multiple threads, the required 
> abilitiy to rollback, or else account for non-rollback waste, etc etc).
>
> I therefore propose a different method to make this failure path 100% 
> safe: when a thread runs out of memory during evacuation, we install 
> an 'evac blocker' word in the object's brooks pointer. I am using the 
> self-reference ORed with just 1, in other words, I am flipping the 
> lowest (otherwise unused) bit of the brooks pointer. Since we're never 
> asking to CAS a fwd pointer with an expected old-value that has this 
> bits set, this means that all future requests to evacuate that object 
> must fail until we clear that bit again. OR, another thread may 
> succeed to evac the object in question, in which case we'd get that 
> evacuated object and we're fine. This is safe, it's atomic, it doesn't 
> block, it finally makes the no-leaf write-barrier slow-path call 
> valid, and we can sleep at night because we don't have this miniscule 
> chance of failing with OOM-during-evac in production, etc ;-)
>
> This change requires that we mask away any lowest bit that could 
> possibly be set in read- and write-barriers, such that we *never* hand 
> out that lowest bit to code that doesn't expect it. This potentially 
> impacts performance of read-barriers. I have run and posted comparison 
> benchmarks [0] which doesn't show any measurable performance impact.
>
> All this stuff is enabled by -XX:+ShenandoahSafeOOMDuringEvac which 
> defaults to false for now.
>
> I also added a flag -XX:+ShenandoahOOMDuringEvacALot which simulates 
> frequent alloc failures in the write-barrier. This, in conjunction 
> with aggressive heuristics, helped me tremendously to iron out bugs in 
> my implementation. I have added such test runs to GCBasher, GCOld and 
> GCLocker tests. Notice that this wasn't enough to provoke an actual 
> race in the old code, at least not in my experiments, but it should 
> make it much more likely.
>
> I also added a solution to the original problem that trigger all this 
> discussion: when our pinning code gets an evac failure, it goes on and 
> attempts to pin a from-space region. Using this new safe way out, we 
> may actually do that, but only on the cancelled path. Later when we're 
> unpinning the object, we flip the region back to regular. As far as I 
> can tell, this is ok. Please let me know if you find a counter-example.
>
> I would like to get your comments on the patch, and approvals from 
> everybody (Christine, Aleksey, Zhengyu and Roland) before pushing. In 
> particular, I'd be happy if Aleksey could run some read-barrier heavy 
> gc-benchmarks to measure the (worst-case) impact of masking in the 
> read-barriers.
>
> http://cr.openjdk.java.net/~rkennke/safe-oom-during-evac/webrev.02/ 
> <http://cr.openjdk.java.net/%7Erkennke/safe-oom-during-evac/webrev.02/>
>
> Cheers, Roman
>
> [0] 
> http://mail.openjdk.java.net/pipermail/shenandoah-dev/2017-October/004118.html
>