RFR: Implement protocol for safe OOM during evacuation handling
Roman Kennke
roman at kennke.org
Mon Mar 5 15:58:18 UTC 2018
Am 05.03.2018 um 16:41 schrieb Zhengyu Gu:
>
>
> On 03/05/2018 10:29 AM, Roman Kennke wrote:
>> Am 05.03.2018 um 16:08 schrieb Zhengyu Gu:
>>> Hi Roman,
>>>
>>>> ShenandoahOOMDuringEvacScope
>>>> - The protocol has been designed to allow repeated calls into
>>>> evac_object even if OOM-during-evac is active:
>>>> - workers may work in strides. OOMing at one object doesn't mean
>>>> it's
>>>> not attempted again for the next
>>>> - write-barriers may return, and go into another write-barrier
>>>> before
>>>> reaching a safepoint
>>>>
>>>> ... it is ok to do that, the protocol will lead to simple+safe RB when
>>>> calling into evac_object() again.
>>>>
>>>> - There are situations when we need to *leave* the scope. Most
>>>> importantly, workers (in partial and traversal) need to signal the
>>>> terminator that they are ready, which will cause them to wait other
>>>> workers to finish, and in which they will *not* be able to give up the
>>>> OOM-counter. We must leave the scope before signalling the terminator,
>>>> and we have ShenandoahOOMDuringEvacScopeLeaver for that. There are a
>>>> few
>>>> other situations where we need to leave the scope to avoid nested
>>>> scoping. Leaving the scope like this is ok because of the above
>>>> mentioned design to allow repeated calls into the protocol.
>>>
>>> This sounds suspicious ... you have counter that drops to 0, then comes
>>> back up, I think there can have race here.
>>>
>>> shenandoahTraversalGC.cpp
>>>
>>> 483 for (uint i = 0; i < stride; i++) {
>>> 484 if ((q->pop_buffer(task) ||
>>> 485 q->pop_local(task) ||
>>> 486 q->pop_overflow(task) ||
>>> 487 (DO_SATB &&
>>> satb_mq_set.apply_closure_to_completed_buffer(&satb_cl) &&
>>> q->pop_buffer(task)) ||
>>> 488 queues->steal(worker_id, &seed, task))) {
>>> 489 conc_mark->do_task<T, true>(q, cl, live_data, &task);
>>> 490 } else {
>>> 491 ShenandoahOOMDuringEvacScopeLeaver oom_scope_leaver;
>>> 492 if (terminator->offer_termination()) return;
>>> 493 }
>>>
>>>
>>> E.g. L#491 counter drops to 0 -> WB returns -> fails to terminate -> it
>>> can evacuate again?
>>>
>>
>> I don't think so. After the protocol is done, it will keep up the
>> OOM_MARKER_MASK until it's cleared during safepoint. This is checked
>> upon entry of the protocol, and does a fast-path-return after setting up
>> the flags to return with RB.
>>
>> If it happens to enter repeatedly while the protocol is in progress, it
>> will participate in it as normal, e.g. will be blocked to enter until
>> the protocol is done, and then observe the OOM_MARKER_MASK and return
>> with RB as above.
>>
>> It must be ok to enter into the protocol repeatedly without race. If you
>> find a race please describe it to me.
>>
> What I described above, is an example, no?
>
> L#491 drops counter to 0, so WB exit.
> L#492 fails to terminate, re-enters protocol. However, there is no check
> if OOM protocol is in progress, then goes to L#484, and find some thing ...
There is a check upon (re-)entry:
https://paste.fedoraproject.org/paste/dVol6OSw8S37dH5RBNztMQ
Which means it raises the TL-flag again in wait_for_no_evac_flag() and
thus cause any encapsulated call to evac_obj() to return with RB. The
actual wait loop will not be entered again if the protocol was already done.
*If* this happens while the protocol is started or in-progress, it will
do the usual CAS counting, which should prevent races.
More information about the shenandoah-dev
mailing list