RFR: Implement protocol for safe OOM during evacuation handling

Mon Mar 5 15:58:18 UTC 2018

Am 05.03.2018 um 16:41 schrieb Zhengyu Gu:
> 
> 
> On 03/05/2018 10:29 AM, Roman Kennke wrote:
>> Am 05.03.2018 um 16:08 schrieb Zhengyu Gu:
>>> Hi Roman,
>>>
>>>> ShenandoahOOMDuringEvacScope
>>>> - The protocol has been designed to allow repeated calls into
>>>> evac_object even if OOM-during-evac is active:
>>>>     - workers may work in strides. OOMing at one object doesn't mean
>>>> it's
>>>> not attempted again for the next
>>>>     - write-barriers may return, and go into another write-barrier
>>>> before
>>>> reaching a safepoint
>>>>
>>>> ... it is ok to do that, the protocol will lead to simple+safe RB when
>>>> calling into evac_object() again.
>>>>
>>>> - There are situations when we need to *leave* the scope. Most
>>>> importantly, workers (in partial and traversal) need to signal the
>>>> terminator that they are ready, which will cause them to wait other
>>>> workers to finish, and in which they will *not* be able to give up the
>>>> OOM-counter. We must leave the scope before signalling the terminator,
>>>> and we have ShenandoahOOMDuringEvacScopeLeaver for that. There are a
>>>> few
>>>> other situations where we need to leave the scope to avoid nested
>>>> scoping. Leaving the scope like this is ok because of the above
>>>> mentioned design to allow repeated calls into the protocol.
>>>
>>> This sounds suspicious ... you have counter that drops to 0, then comes
>>> back up, I think there can have race here.
>>>
>>> shenandoahTraversalGC.cpp
>>>
>>>   483     for (uint i = 0; i < stride; i++) {
>>>   484       if ((q->pop_buffer(task) ||
>>>   485            q->pop_local(task) ||
>>>   486            q->pop_overflow(task) ||
>>>   487            (DO_SATB &&
>>> satb_mq_set.apply_closure_to_completed_buffer(&satb_cl) &&
>>> q->pop_buffer(task)) ||
>>>   488            queues->steal(worker_id, &seed, task))) {
>>>   489         conc_mark->do_task<T, true>(q, cl, live_data, &task);
>>>   490       } else {
>>>   491         ShenandoahOOMDuringEvacScopeLeaver oom_scope_leaver;
>>>   492         if (terminator->offer_termination()) return;
>>>   493       }
>>>
>>>
>>> E.g. L#491 counter drops to 0 -> WB returns -> fails to terminate -> it
>>> can evacuate again?
>>>
>>
>> I don't think so. After the protocol is done, it will keep up the
>> OOM_MARKER_MASK until it's cleared during safepoint. This is checked
>> upon entry of the protocol, and does a fast-path-return after setting up
>> the flags to return with RB.
>>
>> If it happens to enter repeatedly while the protocol is in progress, it
>> will participate in it as normal, e.g. will be blocked to enter until
>> the protocol is done, and then observe the OOM_MARKER_MASK and return
>> with RB as above.
>>
>> It must be ok to enter into the protocol repeatedly without race. If you
>> find a race please describe it to me.
>>
> What I described above, is an example, no?
> 
> L#491 drops counter to 0, so WB exit.
> L#492 fails to terminate, re-enters protocol. However, there is no check
> if OOM protocol is in progress, then goes to L#484, and find some thing ...

There is a check upon (re-)entry:

https://paste.fedoraproject.org/paste/dVol6OSw8S37dH5RBNztMQ

Which means it raises the TL-flag again in wait_for_no_evac_flag() and
thus cause any encapsulated call to evac_obj() to return with RB. The
actual wait loop will not be entered again if the protocol was already done.

*If* this happens while the protocol is started or in-progress, it will
do the usual CAS counting, which should prevent races.