RFR: JDK-8220671: Initialization race for non-JavaThread PtrQueues

Thu Mar 21 08:05:27 UTC 2019

Hi Kim,

>>> I have a different approach here:
>>> http://cr.openjdk.java.net/~kbarrett/8221102/expand_counter.00/
>>> If you agree, reassign 8221102 to me and I'll finish this up.
>>
>> This seems a workable solution too.
>> I am not a fan on mechanisms that depend on stuff not overflowing, though. So, at the very least, we should protect against that.
> 
> It’s not just overflow / wraparound of the counter that is needed to get into trouble.  It also
> requires behavior that seems unlikely, e.g. iterate over superset of threads, then iterate *many*
> times over subset, then iterate over superset again.
> 
> But as I said, it’s easy to deal with, and the code is there to do so.

Ok.

>> When I see multithreaded iterators with claiming mechanisms like that (we have a few of such all over Hotspot, e.g. CLD claiming works similar), it makes me cringe a little. It means that all worker threads are really looking at all items, battling over claim tokens. It would be so much easier and more efficient to just have the array of items, and update the iteration pointer using atomic-add. Right? In case of linked list, it would have to be updated using CAS. I guess with list sizes of few dozen to few 1000 (all threads) it doesn't matter much though.
> 
> But then one gets into array management, which is it’s own hassle.  And sequence management
> may become more expensive (e.g. remove some random entry from the array).   I ended up going
> down that route for OopStorage (for example) because otherwise the scaling of parallel iteration
> was poor, but it’s not without its own downsides.

Right. Also, it would contend over a single token, whereas the current 
approach spreads contention over many tokens.

>> Also, another simple way to fix it is to always scan all threads. That would at least be consistent with what other iterators do (e.g. Threads::threads_do() ). I don't think that performance is a very big issue there.
> 
> That was one of my earlier suggestions.  I think expanding the size of the claim counter is better though.

Ok.

>>> Note that I haven't tried this against your crashing test yet.  I'm
>>> just getting set up to try to reproduce that failure locally.  Based
>>> on your reports, it sounds like there might be yet another issue
>>> lurking.
>>
>> Yeah, something's still up, and this makes this particular change hard to verify ;-)
> 
> Well, I’ve reproduced the failure.  And it failed 2 for 2 (with the baddaddrtest patch applied).  And things
> were looking good with this change for a bit; it didn’t fail until the 8th iteration (around the time I was
> reading the above paragraph from your email!), and then hasn’t passed several attempts since.
> 
> I think my patch to expand the claim counter is a good solution to 8221102 though, so unless you object
> I’m going to claim that bug and send out that patch.

Sure.

> Still trying to debug the crash though.  I haven’t been delving into the string dedup code much before, and
> Shenandoah’s looks pretty different from G1’s.  I doubt the race we’ve been discussing for 8220671 is the
> source of the problem though.  I will continue investigating.

Hint: The problem mostly appears to happen when many such processes run 
concurrently. It probably helps enlarging the race window. So what I is 
to have one terminal window open where I'm running the test in a loop 
(which keeps spawning processes), and in another terminal I manually run 
one process at a time (given the same arguments as jtreg would use), 
probably even in a debugger. Works well and fails almost every run.

I'm now thinking that looking at the actual SATB queues only might be 
misleading. If we somehow really failed to mark the object because of 
dropped items from SATB (for example), the entry should have been 
cleared from the StringDedupQueue in final-mark pause. However, we see a 
stale reference there that points to a reclaimed region. Something else 
must go wrong.

What the previous code did was basically to synchronize the 
StringDedupThread with the VMThread (and any other non-Java-thread) on 
SATB enqueue, via the shared lock. This has gone with your change. I 
wonder if that is opening a race somewhere in strdedup code? I.e. it 
might have covered up a bug that only appears now?

This scoping here looks fishy (altough not outright wrong):
https://paste.fedoraproject.org/paste/o26tktNzT30VEO-rRj-8nQ

Strictly speaking, the oop's scope goes across the safepoint, even 
though it's not used across it. But I can't really tell 100% sure: can 
the compiler do funny reorderings there?

Also, there is a call like this:
   deduplicate_shared_strings(&total_stat);

at the very beginning of StringDedupThread::do_deduplication() that is 
outside of STS and can does heap modification and cause SATB enqueues 
too, and therefore should be within STS, right?

Roman