RFR: JDK-8220671: Initialization race for non-JavaThread PtrQueues

Tue Mar 19 07:16:49 UTC 2019

> On Mar 19, 2019, at 2:48 AM, Roman Kennke <rkennke at redhat.com> wrote:
> 
> 
> 
> 
>>> Ok, I think I have a fix candidate:
>>> http://cr.openjdk.java.net/~rkennke/JDK-8220671/webrev.05/
>>> 
>>> I added a method BS::needs_sts_sync_on_attach_detach() so that GCs
>> can
>>> avoid using this additional synchronization when they are not
>>> implementing attach/detach or not using STS.
>>> 
>>> This passes tier1 and the offending Shenandoah testcase locally. I
>> will
>>> push it through jdk/submit.
>>> 
>>> What do you think? Can I get reviews on this? Or do you have
>> something
>>> better in mind?
>>> 
>>> Thanks,
>>> Roman
>> 
>> I think ZGC has the same race with the thread-local cache of the bad
>> address mask.
> 
> :-S

But Per thinks the non-Java threads still don’t use that value (they weren’t
before, and that hasn’t changed, though might in the future?)

> 
>> The needs_std_sync_on_attach_detach function seems to be just an
>> optimization, that we should consider (or not) once we're confident we
>> have a working solution.
>> 
>> I don't think any GC presently needs anything done on the detach side. 
> 
> Not sure. We might miss the final flush() or rather end up flushing on the wrong side of a safepoint.

Note that very few NJTs ever exit at all.  And those which do are during process
teardown, so not really relevant for global SATB active state change (for example).

>> Using the type of the thread was one idea I was exploring.  I was
>> thinking about using is_ConcurrentGC_thread() though.  Note that some
>> worker threads are ConcurrentGC threads.
> 
> OK. Which concgc threads are worker threads too?

ZGC uses workgang threads configured as “concurrent”.  There’s an argument for
workgang construction indicating whether the associated threads are concurrent
or not.  I *think* there are others, but not in a position to search for them right now.

> 
>> The problem we're trying to solve is that (some implementations of)
>> on_thread_attach needs to be atomic with the associated list addition
>> wrto safepoints, or more precisely, with an operation performed within
>> some safepoints.  on_thread_attach is caching global state in the
>> thread local data, while the operation is updating the global state
>> and needs to also update the thread local caches.  For G1 and
>> Shenandoah, it's the SATB activation state; for ZGC it's the bad bit
>> mask.
>> 
>> Possibilities I've been looking at are:
>> 
>> (1) Conditional STS join around on_thread_attach / add_to_the_list. 
>> Still
>> thinking about what the right conditional might be though.
> 
> This seems the only one that works for me so far.
> 
>> (2) Move on_thread_attach inside the NJTList_lock in add_to_the_list,
>> and make the relevant GC operation also lock the NJTList_lock.  This
>> would mean on_thread_attach is conditionally called with JNTList_lock
>> held, depending on whether the thread is a Java thread or not; such
>> conditional lock contexts can be difficult.
> 
> I don't think this is sufficient. NJTList_lock does not coordinate with safe points like Threads_lock does, but we need that, otherwise the whole sequence races over safe points.

It’s not safepoints that matter, it’s the operation being done in a safepoint, like
changing the SATB active state.  That’s the “relevant GC operation” that also
has to lock the NJTList_lock.

> 
>> (3) I think the reason on_thread_attach preceeds adding to the list is
>> to prevent SATB state verification from failing.  (Except that doesn't
>> quite work because of this new-found race.)  So another approach is to
>> reverse the order and address the verification failure directly.  Add
>> another "unitialized" initial state.  SATB activation verification
>> treats "uninitialized" as okay, and the activation state change
>> unconditionally updates the thread state.  For on_thread_attach,
>> conditionally set the thread state when the old state is
>> "uninitialized", using cmpxchg to address the potential race between a
>> global change and the initialization.  I think something similar could
>> be done for ZGC.  But this is pretty tightly coupled to properties of
>> the state being cached.
> 
> This approach hasn't worked for me. It still races across safe points.

Again, don’t care about safepoints per se, it’s the interaction between
on_thread_attach and changing the global SATB active state that matters.