RFR: 8225702(13): Safepoint counter can't be used for safepoint detection

Mon Jun 24 21:18:19 UTC 2019

Hi Robbin,

On 18/06/2019 1:43 am, Robbin Ehn wrote:
> Hi David,
> 
> On 2019-06-18 06:54, David Holmes wrote:
>> Hi Robbin,
>>
>> On 14/06/2019 6:26 pm, Robbin Ehn wrote:
>>> Hi all, please review.
>>>
>>> This bug fix introduced a new useage of the safepoint counter.
>>> Which does not work with the later pushed faster safepoint code.
>>> https://bugs.openjdk.java.net/browse/JDK-8215889
>>> http://hg.openjdk.java.net/jdk/jdk12/rev/4b469f5f4bf2
>>>
>>> In faster safepoint we must publish the new safepoint counter before 
>>> arming
>>> and publish the inactive counter after. Reading the safepoint counter 
>>> without
>>> having first looked at your local poll can causes faulty assumption 
>>> about
>>> safepoint state.
>>
>> Can you elaborate on exactly how that happens. Naively if you check 
>> the counter and its changed then you know there has been a change to 
>> the safepoint state.
> 
> The bug fix have data which may change during a safepoint, which is okay 
> but it's must know that the data is stale.
> PrepareExtraDataClosure::finish()
> 
> - VMT: Safepoint start, safepoint counter goes to even+1 => ODD.
> - JT: Reads ODD (reading the safepoint counter without checking it's 
> local poll, otherwise it would stop for safepoint)
> - JT: Calls arbitrary code
> - JT: Becomes safepoint safe, e.g. _blocked, but do not stop at the 
> waitbarrier.
> - VM: Start disarm, it first sets _state = _not_synchronized;
> - JT: Wakes up, poll is armed but there is no safepoint or handshake.
> - JT: Continues and re-read ODD
> - JT: My data is not stale => boom
> 
> This will not happen in your normal t1-7 run.

Okay - thanks for explaining above and offline. I have a much better 
handle on the problem, the cause and the proposed solutions.

>>
>>> We add a new safepoint id instead which work reliable and we hide the 
>>> safepoint
>>> counter getter. (JNI fast get can still use safepoint counter, so 
>>> it's not
>>> completely hidden) And provide a simple class to track safepoint state.
>>
>> Does JNI need to use the counter rather than the new id? 
> 
> No the safepoint counter is fine, if counter is odd the thread will goto 
> slow
> path and hit poll. (it may be a false positive but they would be very rare)
> 
> We seem to have some
>> redundancy/overlap between the counter and the id - so do we really 
>> need both? Throw in the actual _safepoint_state and we have three bits 
>> of state information ...
> 
> That is possible, it also possible to mask away the LSB on first read.
> So if the first read of ODD will instead be last even, if false 
> positives are okay, when ODD is read on the back-edge.

Right, but as we discussed the false positive may be an issue and we 
want to avoid disrupting the existing safepoint-counter usage.

> As it is now the _safepoint_counter more belongs to the WaitBarrier.
> So an alternative is to push the _safepoint_counter into the WaitBarrier 
> instead and keep _safepoint_id in SS.
> 
> But since this is JDK 13 bug I did the simplest thing.

Yes this is best. To summarise what you and I discussed:
- older usage of safepoint_counter only does "trivial" actions between 
reading the counts
- the new usage allows more elaborate code between the reading of the 
counts and so hits the current problem
- the new global safepoint-id will fix that code and not disrupt the 
other safepoint-counter users.

Okay I'm fine with changes and your explanations.

Thanks,
David
-----

>>
>>> Also it makes no sense for a normal user that JFR/JMC safepoint id 
>>> are only odd,
>>> we instead use this new id which is just incremented for each safepoint.
>>>
>>> Code:
>>> http://cr.openjdk.java.net/~rehn/8225702/v1/webrev/index.html
>>
>> src/hotspot/share/runtime/safepoint.hpp
>>
>> Why did you make thread_not_running an instance method?
> 
> Since it reads the _safepoint_counter now that I made it private it 
> needed access.
> 
>>
>> src/hotspot/share/runtime/safepoint.cpp
>>
>>   418   // Record state
>>   419   _state = _synchronized;
>>   420
>>   421   OrderAccess::fence();
>>   422
>>   423   // Set the new id
>>   424   ++_safepoint_id;
>>
>> Under the assumption that the threads/code that uses the new 
>> SafepointStateTracker do not participate in the safepoint protocol, is 
>> there a risk of seeing a stale _safepoint_id, or racing with the 
>> increment? It's not clear to me under what conditions the 
>> SafepointStateTracker::safepoint_state_changed() calls will be made.
> 
> We write as:
> state = synch
> id++
> state = unsynch  // read state here
> state = synch
> id++             // read id here
> state = unsynch  // re-read state here
>                   // re-read id here
> 
> This would give a false negative, so it would not work for a JT in
> native/blocked or a NonJavaThread. With doing a 'stable load' thus 
> reading id
> twice. So we should assert this.
> 
> Encoding state and counter in the same primitive which can be atomically
> updated, so we never see the same value in a long enough period as you 
> suggested
> fixes this. I believe doing the same for JavaThread state can also simplify
> somethings.
> 
> But I don't think we should do that as a bug fix for 13.
> And I was going to use the safepoint id in logging as gc does.
> So even if we don't use _safepoint_id to track safepoint state in later JDK
> we have pretty numbers in logs and JFR.
> 
> Thanks, Robbin
> 
>>
>> Thanks,
>> David
>> -----
>>
>>> Issue:
>>> https://bugs.openjdk.java.net/browse/JDK-8225702
>>>
>>> Passes JFR test and t1-3.
>>>
>>> Thanks, Robbin