RFR/RFC: Make OOM-during-evacuation race-free
Roman Kennke
rkennke at redhat.com
Mon Oct 23 15:39:39 UTC 2017
After some discussions on IRC, we decided to take out the pinning stuff
from this change. This requires some more careful consideration and a
new region state. I.e. we need to allow cset -> pinned_cset, and from
pinned_cset back to cset when unpinning and from pinned_cset to pinned
when taking the region out of the cset.
So here comes the reduced patch:
http://cr.openjdk.java.net/~rkennke/safe-oom-during-evac/webrev.03/
<http://cr.openjdk.java.net/%7Erkennke/safe-oom-during-evac/webrev.03/>
WDYT?
Roman
> We have that long-standing single-ugliest spot in Shenandoah that was
> the failure path for when write-barriers would run out-of-memory. We
> added some band-aid for it in the form of spin-waiting until all
> worker threads have settled, and then returning with a read-barrier,
> and this 'solves' it for almost all cases, but it's still racy: 2 Java
> threads could compete to evacuate the same object, one fails with OOM,
> and thus potentially returns the from-space copy, while the other may
> succeed (b/c it still has room in its GCLAB left) and makes/returns a
> to-space copy, thus potentially causing inconsistencies.
>
> We cannot trigger a safepoint (and do a full-gc) while we're in the
> write-barrier because we're inside a no-leaf call and don't have any
> debug info. (In fact, per contract, we should not even block until
> workers settled). We cannot make the write-barrier regular leaf calls
> because this would very seriously affect our ability to optimize
> barriers in C2. We cannot even throw an OOME because we have no debug
> info.
>
> We have this idea still floating around to allocate a reserve. But we
> still haven't found a conclusion how much we actually need to be 100%
> safe. (This seems to be a similarily hard problem like waiting for all
> other threads to settle down. We can solve it for 99.9999...% of
> cases, but so far we failed to come up with a 100% solution due to use
> of GCLABs, related waste, use of multiple threads, the required
> abilitiy to rollback, or else account for non-rollback waste, etc etc).
>
> I therefore propose a different method to make this failure path 100%
> safe: when a thread runs out of memory during evacuation, we install
> an 'evac blocker' word in the object's brooks pointer. I am using the
> self-reference ORed with just 1, in other words, I am flipping the
> lowest (otherwise unused) bit of the brooks pointer. Since we're never
> asking to CAS a fwd pointer with an expected old-value that has this
> bits set, this means that all future requests to evacuate that object
> must fail until we clear that bit again. OR, another thread may
> succeed to evac the object in question, in which case we'd get that
> evacuated object and we're fine. This is safe, it's atomic, it doesn't
> block, it finally makes the no-leaf write-barrier slow-path call
> valid, and we can sleep at night because we don't have this miniscule
> chance of failing with OOM-during-evac in production, etc ;-)
>
> This change requires that we mask away any lowest bit that could
> possibly be set in read- and write-barriers, such that we *never* hand
> out that lowest bit to code that doesn't expect it. This potentially
> impacts performance of read-barriers. I have run and posted comparison
> benchmarks [0] which doesn't show any measurable performance impact.
>
> All this stuff is enabled by -XX:+ShenandoahSafeOOMDuringEvac which
> defaults to false for now.
>
> I also added a flag -XX:+ShenandoahOOMDuringEvacALot which simulates
> frequent alloc failures in the write-barrier. This, in conjunction
> with aggressive heuristics, helped me tremendously to iron out bugs in
> my implementation. I have added such test runs to GCBasher, GCOld and
> GCLocker tests. Notice that this wasn't enough to provoke an actual
> race in the old code, at least not in my experiments, but it should
> make it much more likely.
>
> I also added a solution to the original problem that trigger all this
> discussion: when our pinning code gets an evac failure, it goes on and
> attempts to pin a from-space region. Using this new safe way out, we
> may actually do that, but only on the cancelled path. Later when we're
> unpinning the object, we flip the region back to regular. As far as I
> can tell, this is ok. Please let me know if you find a counter-example.
>
> I would like to get your comments on the patch, and approvals from
> everybody (Christine, Aleksey, Zhengyu and Roland) before pushing. In
> particular, I'd be happy if Aleksey could run some read-barrier heavy
> gc-benchmarks to measure the (worst-case) impact of masking in the
> read-barriers.
>
> http://cr.openjdk.java.net/~rkennke/safe-oom-during-evac/webrev.02/
> <http://cr.openjdk.java.net/%7Erkennke/safe-oom-during-evac/webrev.02/>
>
> Cheers, Roman
>
> [0]
> http://mail.openjdk.java.net/pipermail/shenandoah-dev/2017-October/004118.html
>
More information about the shenandoah-dev
mailing list