RFR: 8225702(13): Safepoint counter can't be used for safepoint detection
David Holmes
david.holmes at oracle.com
Mon Jun 24 21:18:19 UTC 2019
Hi Robbin,
On 18/06/2019 1:43 am, Robbin Ehn wrote:
> Hi David,
>
> On 2019-06-18 06:54, David Holmes wrote:
>> Hi Robbin,
>>
>> On 14/06/2019 6:26 pm, Robbin Ehn wrote:
>>> Hi all, please review.
>>>
>>> This bug fix introduced a new useage of the safepoint counter.
>>> Which does not work with the later pushed faster safepoint code.
>>> https://bugs.openjdk.java.net/browse/JDK-8215889
>>> http://hg.openjdk.java.net/jdk/jdk12/rev/4b469f5f4bf2
>>>
>>> In faster safepoint we must publish the new safepoint counter before
>>> arming
>>> and publish the inactive counter after. Reading the safepoint counter
>>> without
>>> having first looked at your local poll can causes faulty assumption
>>> about
>>> safepoint state.
>>
>> Can you elaborate on exactly how that happens. Naively if you check
>> the counter and its changed then you know there has been a change to
>> the safepoint state.
>
> The bug fix have data which may change during a safepoint, which is okay
> but it's must know that the data is stale.
> PrepareExtraDataClosure::finish()
>
> - VMT: Safepoint start, safepoint counter goes to even+1 => ODD.
> - JT: Reads ODD (reading the safepoint counter without checking it's
> local poll, otherwise it would stop for safepoint)
> - JT: Calls arbitrary code
> - JT: Becomes safepoint safe, e.g. _blocked, but do not stop at the
> waitbarrier.
> - VM: Start disarm, it first sets _state = _not_synchronized;
> - JT: Wakes up, poll is armed but there is no safepoint or handshake.
> - JT: Continues and re-read ODD
> - JT: My data is not stale => boom
>
> This will not happen in your normal t1-7 run.
Okay - thanks for explaining above and offline. I have a much better
handle on the problem, the cause and the proposed solutions.
>>
>>> We add a new safepoint id instead which work reliable and we hide the
>>> safepoint
>>> counter getter. (JNI fast get can still use safepoint counter, so
>>> it's not
>>> completely hidden) And provide a simple class to track safepoint state.
>>
>> Does JNI need to use the counter rather than the new id?
>
> No the safepoint counter is fine, if counter is odd the thread will goto
> slow
> path and hit poll. (it may be a false positive but they would be very rare)
>
> We seem to have some
>> redundancy/overlap between the counter and the id - so do we really
>> need both? Throw in the actual _safepoint_state and we have three bits
>> of state information ...
>
> That is possible, it also possible to mask away the LSB on first read.
> So if the first read of ODD will instead be last even, if false
> positives are okay, when ODD is read on the back-edge.
Right, but as we discussed the false positive may be an issue and we
want to avoid disrupting the existing safepoint-counter usage.
> As it is now the _safepoint_counter more belongs to the WaitBarrier.
> So an alternative is to push the _safepoint_counter into the WaitBarrier
> instead and keep _safepoint_id in SS.
>
> But since this is JDK 13 bug I did the simplest thing.
Yes this is best. To summarise what you and I discussed:
- older usage of safepoint_counter only does "trivial" actions between
reading the counts
- the new usage allows more elaborate code between the reading of the
counts and so hits the current problem
- the new global safepoint-id will fix that code and not disrupt the
other safepoint-counter users.
Okay I'm fine with changes and your explanations.
Thanks,
David
-----
>>
>>> Also it makes no sense for a normal user that JFR/JMC safepoint id
>>> are only odd,
>>> we instead use this new id which is just incremented for each safepoint.
>>>
>>> Code:
>>> http://cr.openjdk.java.net/~rehn/8225702/v1/webrev/index.html
>>
>> src/hotspot/share/runtime/safepoint.hpp
>>
>> Why did you make thread_not_running an instance method?
>
> Since it reads the _safepoint_counter now that I made it private it
> needed access.
>
>>
>> src/hotspot/share/runtime/safepoint.cpp
>>
>> 418 // Record state
>> 419 _state = _synchronized;
>> 420
>> 421 OrderAccess::fence();
>> 422
>> 423 // Set the new id
>> 424 ++_safepoint_id;
>>
>> Under the assumption that the threads/code that uses the new
>> SafepointStateTracker do not participate in the safepoint protocol, is
>> there a risk of seeing a stale _safepoint_id, or racing with the
>> increment? It's not clear to me under what conditions the
>> SafepointStateTracker::safepoint_state_changed() calls will be made.
>
> We write as:
> state = synch
> id++
> state = unsynch // read state here
> state = synch
> id++ // read id here
> state = unsynch // re-read state here
> // re-read id here
>
> This would give a false negative, so it would not work for a JT in
> native/blocked or a NonJavaThread. With doing a 'stable load' thus
> reading id
> twice. So we should assert this.
>
> Encoding state and counter in the same primitive which can be atomically
> updated, so we never see the same value in a long enough period as you
> suggested
> fixes this. I believe doing the same for JavaThread state can also simplify
> somethings.
>
> But I don't think we should do that as a bug fix for 13.
> And I was going to use the safepoint id in logging as gc does.
> So even if we don't use _safepoint_id to track safepoint state in later JDK
> we have pretty numbers in logs and JFR.
>
> Thanks, Robbin
>
>>
>> Thanks,
>> David
>> -----
>>
>>> Issue:
>>> https://bugs.openjdk.java.net/browse/JDK-8225702
>>>
>>> Passes JFR test and t1-3.
>>>
>>> Thanks, Robbin
More information about the hotspot-jfr-dev
mailing list