RFR: 8225702(13): Safepoint counter can't be used for safepoint detection

Tue Jun 18 08:43:10 UTC 2019

Hi David,

On 2019-06-18 06:54, David Holmes wrote:
> Hi Robbin,
> 
> On 14/06/2019 6:26 pm, Robbin Ehn wrote:
>> Hi all, please review.
>>
>> This bug fix introduced a new useage of the safepoint counter.
>> Which does not work with the later pushed faster safepoint code.
>> https://bugs.openjdk.java.net/browse/JDK-8215889
>> http://hg.openjdk.java.net/jdk/jdk12/rev/4b469f5f4bf2
>>
>> In faster safepoint we must publish the new safepoint counter before arming
>> and publish the inactive counter after. Reading the safepoint counter without
>> having first looked at your local poll can causes faulty assumption about
>> safepoint state.
> 
> Can you elaborate on exactly how that happens. Naively if you check the counter 
> and its changed then you know there has been a change to the safepoint state.

The bug fix have data which may change during a safepoint, which is okay but 
it's must know that the data is stale.
PrepareExtraDataClosure::finish()

- VMT: Safepoint start, safepoint counter goes to even+1 => ODD.
- JT: Reads ODD (reading the safepoint counter without checking it's local poll, 
otherwise it would stop for safepoint)
- JT: Calls arbitrary code
- JT: Becomes safepoint safe, e.g. _blocked, but do not stop at the waitbarrier.
- VM: Start disarm, it first sets _state = _not_synchronized;
- JT: Wakes up, poll is armed but there is no safepoint or handshake.
- JT: Continues and re-read ODD
- JT: My data is not stale => boom

This will not happen in your normal t1-7 run.

> 
>> We add a new safepoint id instead which work reliable and we hide the safepoint
>> counter getter. (JNI fast get can still use safepoint counter, so it's not
>> completely hidden) And provide a simple class to track safepoint state.
> 
> Does JNI need to use the counter rather than the new id? 

No the safepoint counter is fine, if counter is odd the thread will goto slow
path and hit poll. (it may be a false positive but they would be very rare)

We seem to have some
> redundancy/overlap between the counter and the id - so do we really need both? 
> Throw in the actual _safepoint_state and we have three bits of state information 
> ...

That is possible, it also possible to mask away the LSB on first read.
So if the first read of ODD will instead be last even, if false positives are
okay, when ODD is read on the back-edge.

As it is now the _safepoint_counter more belongs to the WaitBarrier.
So an alternative is to push the _safepoint_counter into the WaitBarrier instead
and keep _safepoint_id in SS.

But since this is JDK 13 bug I did the simplest thing.

> 
>> Also it makes no sense for a normal user that JFR/JMC safepoint id are only odd,
>> we instead use this new id which is just incremented for each safepoint.
>>
>> Code:
>> http://cr.openjdk.java.net/~rehn/8225702/v1/webrev/index.html
> 
> src/hotspot/share/runtime/safepoint.hpp
> 
> Why did you make thread_not_running an instance method?

Since it reads the _safepoint_counter now that I made it private it needed access.

> 
> src/hotspot/share/runtime/safepoint.cpp
> 
>   418   // Record state
>   419   _state = _synchronized;
>   420
>   421   OrderAccess::fence();
>   422
>   423   // Set the new id
>   424   ++_safepoint_id;
> 
> Under the assumption that the threads/code that uses the new 
> SafepointStateTracker do not participate in the safepoint protocol, is there a 
> risk of seeing a stale _safepoint_id, or racing with the increment? It's not 
> clear to me under what conditions the 
> SafepointStateTracker::safepoint_state_changed() calls will be made.

We write as:
state = synch
id++
state = unsynch  // read state here
state = synch
id++             // read id here
state = unsynch  // re-read state here
                  // re-read id here

This would give a false negative, so it would not work for a JT in
native/blocked or a NonJavaThread. With doing a 'stable load' thus reading id
twice. So we should assert this.

Encoding state and counter in the same primitive which can be atomically
updated, so we never see the same value in a long enough period as you suggested
fixes this. I believe doing the same for JavaThread state can also simplify
somethings.

But I don't think we should do that as a bug fix for 13.
And I was going to use the safepoint id in logging as gc does.
So even if we don't use _safepoint_id to track safepoint state in later JDK
we have pretty numbers in logs and JFR.

Thanks, Robbin

> 
> Thanks,
> David
> -----
> 
>> Issue:
>> https://bugs.openjdk.java.net/browse/JDK-8225702
>>
>> Passes JFR test and t1-3.
>>
>> Thanks, Robbin