Fixing the OOM-during-evac

Zhengyu Gu zgu at redhat.com
Wed Feb 28 14:18:57 UTC 2018


On 02/28/2018 09:08 AM, Zhengyu Gu wrote:
> Do you really need OOM_MASK? will cancelled_concgc() enough?

Nevermind, that may be a race.

Thanks,

-Zhengyu


> 
> so every worker -> work(uint worker_id) -> inc counter -> do works -> 
> dec counter
> 
> Java Thread -> wb -> inc counter -> evac -> dec counter
> Java Thread -> wb -> inc counter -> evac oom -> cancel concgc -> dec 
> counter -> wait counter == 0 -> RB
> 
> Right?
> 
> Thanks,
> 
> -Zhengyu
> 
> 
> 
> 
> On 02/28/2018 08:53 AM, Roman Kennke wrote:
>> Here's my current prototype which seems to pass initial tests with
>> -XX:+ShenandoahOOMDuringEvacALot
>>
>> http://cr.openjdk.java.net/~rkennke/safe-oom-during-evac-counter.patch
>>
>> It's slightly dirty. It's likely to be slow because it currently
>> enters/leaves the protected section for each object, even for GC
>> threads, which should not happen.
>>
>> Roman
>>
>> On Wed, Feb 28, 2018 at 12:42 PM, Roman Kennke <rkennke at redhat.com> 
>> wrote:
>>> While implementing the prototype, I came upon an issue with the
>>> protocol: if we get the OOM marker into the counter, we loose the
>>> actual counter.
>>>
>>> The solution is to not CAS a full special value, but mask the current
>>> counter with an extra bit and handle/mask that accordingly.
>>>
>>> Roman
>>>
>>> On Wed, Feb 28, 2018 at 11:42 AM, Roman Kennke <rkennke at redhat.com> 
>>> wrote:
>>>> This issue keeps haunting me. :-)
>>>> Over coffee, I had an idea how to solve it. Let me outline it and open
>>>> for discussion.
>>>>
>>>> The issue is that when a Java thread hits OOM while in the
>>>> write-barrier, another thread (Java or GC) may still succeed to
>>>> evacuate the object. This is racy, because thread#1 may get a
>>>> from-space copy and write to this, while other threads may get a
>>>> to-space copy and write to that.
>>>>
>>>> We need to prevent any other thread from evacuating our failed-to-evac
>>>> object, or else safely get the other copy.
>>>>
>>>> My idea is to have a counter for number of threads in the evacuation
>>>> path, and as soon as we hit OOM there, wait until the counter drops to
>>>> zero, at which point we can be sure to not get the object evacuated
>>>> under our feet.
>>>>
>>>> We need to protect the evacuation path with the following protocol.
>>>> 'The evacuation path' is the code around actual evacuation, i.e.
>>>> inside the evac-in-progress- and cset-checks, but around the actual
>>>> evac. This needs to be done both in fast- and slow-path.
>>>>
>>>> There is a global counter that shows the number of threads inside the
>>>> evac-path, OR a special value (e.g. something negative) to indicate
>>>> OOM-during-evac (i.e. no threads are allowed to enter the path).
>>>>
>>>> Upon entry of the evac-path, any threads will attempt to increase the
>>>> counter, using a CAS. Depending on the result of the CAS:
>>>> - success: carry on with evac
>>>> - failure:
>>>>    - if offending value is a valid counter, then try again
>>>>    - if offending value is OOM-during-evac special value: loop until
>>>> counter drops to 0, then exit with read-barrier
>>>>
>>>> Upon exit, any threads will decrease the counter using atomic dec.
>>>>
>>>> Upon OOM-during-evac, any thread will attempt to CAS OOM-during-evac
>>>> special value into the counter. Depending on result:
>>>> - success: busy-loop until counter drops to zero, then exit with RB
>>>> - failure:
>>>>    - offender is valid counter update: try again
>>>>    - offender is OOM-during-evac: busy loop until counter drops to
>>>> zero, then exit with RB
>>>>
>>>> For Java threads, this protocol needs to be done in the fast
>>>> (assembly) path too, because they can cause evacs. Or else, we could
>>>> decide to disable the fast-path altogether (I was never really sure if
>>>> the extra assembly did us much good).
>>>>
>>>> GC threads don't have to protect every single evacuation, but can
>>>> instead do the protocol wholesale: i.e. enter on worker start, and
>>>> exit on worker done.
>>>>
>>>> Please think hard about this possible solutions and try to poke holes
>>>> into it. Meanwhile, I'll come up with a prototype.
>>>>
>>>> Cheers, Roman


More information about the shenandoah-dev mailing list