Fixing the OOM-during-evac
Zhengyu Gu
zgu at redhat.com
Wed Feb 28 14:08:49 UTC 2018
Do you really need OOM_MASK? will cancelled_concgc() enough?
so every worker -> work(uint worker_id) -> inc counter -> do works ->
dec counter
Java Thread -> wb -> inc counter -> evac -> dec counter
Java Thread -> wb -> inc counter -> evac oom -> cancel concgc -> dec
counter -> wait counter == 0 -> RB
Right?
Thanks,
-Zhengyu
On 02/28/2018 08:53 AM, Roman Kennke wrote:
> Here's my current prototype which seems to pass initial tests with
> -XX:+ShenandoahOOMDuringEvacALot
>
> http://cr.openjdk.java.net/~rkennke/safe-oom-during-evac-counter.patch
>
> It's slightly dirty. It's likely to be slow because it currently
> enters/leaves the protected section for each object, even for GC
> threads, which should not happen.
>
> Roman
>
> On Wed, Feb 28, 2018 at 12:42 PM, Roman Kennke <rkennke at redhat.com> wrote:
>> While implementing the prototype, I came upon an issue with the
>> protocol: if we get the OOM marker into the counter, we loose the
>> actual counter.
>>
>> The solution is to not CAS a full special value, but mask the current
>> counter with an extra bit and handle/mask that accordingly.
>>
>> Roman
>>
>> On Wed, Feb 28, 2018 at 11:42 AM, Roman Kennke <rkennke at redhat.com> wrote:
>>> This issue keeps haunting me. :-)
>>> Over coffee, I had an idea how to solve it. Let me outline it and open
>>> for discussion.
>>>
>>> The issue is that when a Java thread hits OOM while in the
>>> write-barrier, another thread (Java or GC) may still succeed to
>>> evacuate the object. This is racy, because thread#1 may get a
>>> from-space copy and write to this, while other threads may get a
>>> to-space copy and write to that.
>>>
>>> We need to prevent any other thread from evacuating our failed-to-evac
>>> object, or else safely get the other copy.
>>>
>>> My idea is to have a counter for number of threads in the evacuation
>>> path, and as soon as we hit OOM there, wait until the counter drops to
>>> zero, at which point we can be sure to not get the object evacuated
>>> under our feet.
>>>
>>> We need to protect the evacuation path with the following protocol.
>>> 'The evacuation path' is the code around actual evacuation, i.e.
>>> inside the evac-in-progress- and cset-checks, but around the actual
>>> evac. This needs to be done both in fast- and slow-path.
>>>
>>> There is a global counter that shows the number of threads inside the
>>> evac-path, OR a special value (e.g. something negative) to indicate
>>> OOM-during-evac (i.e. no threads are allowed to enter the path).
>>>
>>> Upon entry of the evac-path, any threads will attempt to increase the
>>> counter, using a CAS. Depending on the result of the CAS:
>>> - success: carry on with evac
>>> - failure:
>>> - if offending value is a valid counter, then try again
>>> - if offending value is OOM-during-evac special value: loop until
>>> counter drops to 0, then exit with read-barrier
>>>
>>> Upon exit, any threads will decrease the counter using atomic dec.
>>>
>>> Upon OOM-during-evac, any thread will attempt to CAS OOM-during-evac
>>> special value into the counter. Depending on result:
>>> - success: busy-loop until counter drops to zero, then exit with RB
>>> - failure:
>>> - offender is valid counter update: try again
>>> - offender is OOM-during-evac: busy loop until counter drops to
>>> zero, then exit with RB
>>>
>>> For Java threads, this protocol needs to be done in the fast
>>> (assembly) path too, because they can cause evacs. Or else, we could
>>> decide to disable the fast-path altogether (I was never really sure if
>>> the extra assembly did us much good).
>>>
>>> GC threads don't have to protect every single evacuation, but can
>>> instead do the protocol wholesale: i.e. enter on worker start, and
>>> exit on worker done.
>>>
>>> Please think hard about this possible solutions and try to poke holes
>>> into it. Meanwhile, I'll come up with a prototype.
>>>
>>> Cheers, Roman
More information about the shenandoah-dev
mailing list