RFR: JDK-8220671: Initialization race for non-JavaThread PtrQueues

Tue Mar 19 12:01:10 UTC 2019

>>> The problem we're trying to solve is that (some implementations of)
>>> on_thread_attach needs to be atomic with the associated list addition
>>> wrto safepoints, or more precisely, with an operation performed within
>>> some safepoints.  on_thread_attach is caching global state in the
>>> thread local data, while the operation is updating the global state
>>> and needs to also update the thread local caches.  For G1 and
>>> Shenandoah, it's the SATB activation state; for ZGC it's the bad bit
>>> mask.

Yes, it is not strictly the safepoint that is the issue. But as you say, 
there's state that we're reading concurrently that is otherwise written 
at safepoints (the SATB active flag), and not protected otherwise (e.g. 
by some lock that we could take when reading). There's also global flags 
and state updated at safepoints, and I am not sure we're good with that 
either if we don't let the non-Java-thread synchronize with safepoints 
on attach.

Therefore I think that the way to go is to let such threads synchronize 
with safepoints on on_thread_attach(), and via STS. A Shenandoah-only 
solution that seems to do it:

http://cr.openjdk.java.net/~rkennke/JDK-8220671/webrev.06/

Compared to my other proposals it has the advantage that it's only 
changing GC specific code, and not with any global/shared code. It will 
not introduce STS into paths that don't synchronize this anyway (e.g. CMS).

Unless we come up with something even better, this would be the way to 
go for me. If you agree, I can do the G1 change in the same fashion.

Roman

>>> Possibilities I've been looking at are:
>>>
>>> (1) Conditional STS join around on_thread_attach / add_to_the_list.
>>> Still
>>> thinking about what the right conditional might be though.
>>
>> This seems the only one that works for me so far.
>>
>>> (2) Move on_thread_attach inside the NJTList_lock in add_to_the_list,
>>> and make the relevant GC operation also lock the NJTList_lock.  This
>>> would mean on_thread_attach is conditionally called with JNTList_lock
>>> held, depending on whether the thread is a Java thread or not; such
>>> conditional lock contexts can be difficult.
>>
>> I don't think this is sufficient. NJTList_lock does not coordinate with safe points like Threads_lock does, but we need that, otherwise the whole sequence races over safe points.
> 
> It’s not safepoints that matter, it’s the operation being done in a safepoint, like
> changing the SATB active state.  That’s the “relevant GC operation” that also
> has to lock the NJTList_lock.
> 
>>
>>> (3) I think the reason on_thread_attach preceeds adding to the list is
>>> to prevent SATB state verification from failing.  (Except that doesn't
>>> quite work because of this new-found race.)  So another approach is to
>>> reverse the order and address the verification failure directly.  Add
>>> another "unitialized" initial state.  SATB activation verification
>>> treats "uninitialized" as okay, and the activation state change
>>> unconditionally updates the thread state.  For on_thread_attach,
>>> conditionally set the thread state when the old state is
>>> "uninitialized", using cmpxchg to address the potential race between a
>>> global change and the initialization.  I think something similar could
>>> be done for ZGC.  But this is pretty tightly coupled to properties of
>>> the state being cached.
>>
>> This approach hasn't worked for me. It still races across safe points.
> 
> Again, don’t care about safepoints per se, it’s the interaction between
> on_thread_attach and changing the global SATB active state that matters.
>