RFR: JDK-8220671: Initialization race for non-JavaThread PtrQueues

Wed Mar 20 19:32:21 UTC 2019

> On Mar 19, 2019, at 6:56 PM, Roman Kennke <rkennke at redhat.com> wrote:
> 
> 
> 
> Am 19.03.19 um 23:50 schrieb Kim Barrett:
>>> On Mar 19, 2019, at 5:51 PM, Roman Kennke <rkennke at redhat.com> wrote:
>>> 
>>> See:
>>> https://bugs.openjdk.java.net/browse/JDK-8221102
>>> 
>>> I believe G1 is also affected by this problem, as it seems to flush SATB queues in pretty much the same way as Shenandoah:
>>> 
>>> http://hg.openjdk.java.net/jdk/jdk/file/ddfb658c8ce3/src/hotspot/share/gc/g1/g1ConcurrentMark.cpp#l1776
>>> 
>>> Please check it, and update as appropriate. I'm working on a fix.
>>> 
>>> Roman
>> 
>> I did some looking at the claiming mechanism, and I agree there's an
>> inconsistency of usage that can mess things up.
>> 
>> We have possibly_parallel_threads_do, which only looks at Java threads
>> and the VM thread.  Other threads don't get examined, and so don't
>> have their claim state updated.
>> 
>> We also have callers of Threads::threads_do where the closure's
>> do_thread conditionalizes its operation on a call to claim_oops_do.
>> So all threads are examined and their claim state is updated.
>> 
>> It seems like the right (or should I say wrong) sequence of operations
>> could cause an NJT being visited by Threads::threads_do to appear to
>> have already been claimed from the get go, and so will be skipped by
>> all of the parallel claiming threads.
>> 
>> Sprinkling more assert_all_threads_claimed won't currently help detect
>> this, since it too only examines Java threads and the VM thread.
>> 
>> It looks like an even number of uses of possibly_parallel_threads_do
>> between a pair of Threads::threads_do with explicit claims will
>> confuse the second Threads::threads_do.
>> 
>> And of course, there's the problem that Shenandoah is using
>> possibly_parallel_threads_do in a place where skipping the
>> StringDedupThread is bad.
>> 
>> Yikes!  How is this ever working at all?
>> 
>> I'm not sure why possibly_parallel_threads_do applies to a subset of
>> the threads.  One solution would be to change that.  But there are
>> users of it that maybe don't expect other kinds of threads and might
>> not want to pay extra to examine and ignore them.  Another would be
>> to have two different claim fields for the two different iterators.
>> They could be packed into the same thread member at some small cost,
>> though looking at where the current member is located, adding a second
>> member there should not cost any space in 64bit platforms.
> 
> My current idea goes roughly like this (includes some Shenandoah mess
> that will not be there in final webrev):
> 
> http://cr.openjdk.java.net/~rkennke/JDK-8221102.patch
> 
> However, this *still* doesn't solve my crashing testcase. Digging even
> deeper...
> 
> Roman

I have a different approach here:
http://cr.openjdk.java.net/~kbarrett/8221102/expand_counter.00/

The idea is that there's no fundamental reason for the claim token to
be limited to a single bit counter + "uninitialized" value, as is done
in the current code.  Instead, we can expand the counter to full uint
(or uintx even, because of alignment of surrounding members), with
zero reserved for "uninitialized".

On counter overflow, as with the current code, there is a risk of
collision because of different thread sets being processed.  But with
a full width counter that's really only theoretical; it takes a long
time to overflow, esp. with a 64bit counter, and the number of subset
iterations since last superset iteration needs to be the max counter
value.  With the current 1bit counter we just need an even number of
subset iterations between superset iterations to get into trouble.
But that number increases with the counter size, and real code
(probably) doesn't behave that way.

And it's pretty easy to deal with even that remote possibility, by
just resetting all the thread tokens to zero on global counter
overflow.

If you agree, reassign 8221102 to me and I'll finish this up.

Note that I haven't tried this against your crashing test yet.  I'm
just getting set up to try to reproduce that failure locally.  Based
on your reports, it sounds like there might be yet another issue
lurking.