RFR (S) 8049304: race between VM_Exit and _sync_FutileWakeups->inc()

Wed Sep 2 01:56:26 UTC 2015

On 9/1/15 7:43 PM, David Holmes wrote:
> Hi Dan,
>
> On 2/09/2015 2:56 AM, Daniel D. Daugherty wrote:
>> On 9/1/15 8:49 AM, Daniel D. Daugherty wrote:
>>> On 9/1/15 12:13 AM, David Holmes wrote:
>>>> Hi Dan,
>>>>
>>>> Hope you had that coffee :)
>>>
>>> Just got that second cup... :-)
>
> You might need to upgrade to something stronger :)

If I do that this late (in my TZ), I won't sleep... :-)

>
>>>>
>>>> On 1/09/2015 8:51 AM, Daniel D. Daugherty wrote:
>>>>> Greetings,
>>>>>
>>>>> I've updated the "fix" for this bug based on code review comments
>>>>> received in round 0.
>>>>>
>>>>> JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
>>>>> https://bugs.openjdk.java.net/browse/JDK-8049304
>>>>>
>>>>> Webrev URL:
>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/1-jdk9-hs-rt/
>>>>>
>>>>> The easiest way to re-review is to download the two patch files
>>>>> and view them in your favorite file merge tool:
>>>>>
>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/0-jdk9-hs-rt/hotspot.patch 
>>>>>
>>>>>
>>>>>
>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/1-jdk9-hs-rt/hotspot.patch 
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Testing: Aurora Adhoc RT-SVC nightly batch (in process)
>>>>>           Aurora Adhoc vm.tmtools batch (in process)
>>>>>           Kim's repro sequence for JDK-8049304
>>>>>           Kim's repro sequence for JDK-8129978
>>>>>           JPRT -testset hotspot
>>>>>
>>>>> Changes between round 0 and round 1:
>>>>>
>>>>> - add an 'is_safe' parameter to the OM_PERFDATA_OP macro;
>>>>
>>>> The sense of safepoint-safe is a bit confusing to me. If the thread
>>>> is in a safepoint-safe-state then it seems safepoint-safe==false ??
>>>
>>> I went back and forth on the naming of this parameter. When 'is_safe'
>>> is true, the calling thread can safely and directly check has_PerfData
>>> because we are not at a safepoint so we can't be racing with a normal
>>> exit. In a normal exit, has_PerfData is only changed (and memory freed)
>>> at a safepoint.
>>>
>>> When 'is_safe' is false, then the code path we're on could be executing
>>> in parallel with a safepoint so we need to be more careful with 
>>> accessing
>>> the has_PerfData flag...
>>>
>>> If you have a better name for the parameter...
>
> I think not referring to it as "safepoint-safe" at the call sites 
> would help, but as per below I think is-safe completely unnecessary.

OK so go back to what I had before with just has_PerfData() that does
direct flag check and drop the whole has_PerfData_with_acquire().

>
>>>>
>>>>>    safepoint-safe callers can access _has_PerfData flag directly;
>>>>>    non-safepoint-safe callers use a load-acquire to fetch the
>>>>>    current _has_PerfData flag value
>>>>
>>>> load-acquire makes no difference to the "currentness" of
>>>> _has_PerfData and has no affect on the size of the "race window". So
>>>> I don't see any point to this is-safe distinction.
>>>
>>> Hmmm... I think I got this idea when I re-read orderAccess.hpp where
>>> it talks about release-store and load-acquire pairs. I originally
>>> had the transition of _has_PerfData from '1' -> '0' done by a
>>> release_store() so I was using a load_acquire() as part of the
>>> pair. I've since switched to a direct setting of "_has_PerfData = 0"
>>> followed by the big hammer fence(), but I think the load_acquire()
>>> still pairs with that fence().
>>
>> I rewrote the comment above the macro:
>>
>> $ hg diff src/share/vm/runtime/objectMonitor.hpp
>> diff -r efc17f03e5d4 src/share/vm/runtime/objectMonitor.hpp
>> --- a/src/share/vm/runtime/objectMonitor.hpp    Thu Aug 20 10:58:57 2015
>> -0700
>> +++ b/src/share/vm/runtime/objectMonitor.hpp    Tue Sep 01 09:54:53 2015
>> -0700
>> @@ -177,6 +177,33 @@ class ObjectMonitor {
>>
>>    public:
>>     static void Initialize();
>> +
>> +  // Only perform a PerfData operation if the PerfData object has been
>> +  // allocated and if the PerfDataManager has not freed the PerfData
>> +  // objects which can happen at normal VM shutdown. If is_safe == true
>> +  // then the calling context does not execute in parallel with a
>> +  // safepoint and we can simply query the flag. Otherwise, the calling
>> +  // context may be executing in parallel with a safepoint so we use
>> +  // OrderAccess::load_acquire() to pair with the code that may be
>> +  // setting the has_PerfData flag to false.
>
> The whole point of the release/acquire pairings is to ensure that when 
> you do the load_acquire and see the magic value (be it 1 or 0) then 
> everything prior to the release_store that set that magic value, is 
> also visible. But here when we set _has_PerfData=0 there is nothing 
> prior to that that the code checking _hs_PerfData==0 cares about. The 
> normal usage pattern has the form:
>
> writer:
>   initialize_data();
>   release_store(dataReady, true)
>
> reader:
>   if (load_acquire(dataready) == true)
>     safely_use_data();
>
> but that isn't the pattern here, so the load_acquire (and the fence()) 
> serves no actual purpose. So the whole is-safe distinction really 
> serves no purpose in my opinion.

So drop the fence() entirely and rely on just the sleep call.

>
> The reason why I object more strongly to the fence() versus the 
> release_store(), is that use of the fence() seems (to me) a stronger 
> statement of requirement "hey we really do need this full 
> bidirectional fence". Further originally I hadn't realized what Tom(?) 
> pointed out regarding the fact that the release_store puts the barrier 
> before the store and we want the barrier after the store. Then 
> followed the discussion that what we need is storeload|storestore to 
> maintain the ordering (ignore the os::sleep()). But that somehow has 
> morphed into a full fence() - which I don't see the need for.

OK, so I'll look at making the above changes and put out another
webrev for code review round 2. Don't know if Tom or Kim are on
board with where you want this to go, but if not... we'll do this
again...

Dan

>
> Cheers,
> David
> -----
>
>> +  //
>> +  #define OM_PERFDATA_OP(f, op_str, is_safe)                        \
>> +    do {                                                            \
>> +      if (ObjectMonitor::_sync_ ## f != NULL) {                     \
>> +        bool do_the_op = false;                                     \
>> +        if (is_safe) {                                              \
>> +          if (PerfDataManager::has_PerfData()) {                    \
>> +            do_the_op = true;                                       \
>> + }                                                         \
>> +        } else if (PerfDataManager::has_PerfData_with_acquire()) {  \
>> +          do_the_op = true;                                         \
>> + }                                                           \
>> +        if (do_the_op) {                                            \
>> +          ObjectMonitor::_sync_ ## f->op_str;                       \
>> + }                                                           \
>> + } \
>> +    } while (0)
>> +
>>     static PerfCounter * _sync_ContendedLockAttempts;
>>     static PerfCounter * _sync_FutileWakeups;
>>     static PerfCounter * _sync_Parks;
>>
>>
>>
>> Hopefully, you like this version better?
>>
>> Dan
>>
>>
>>>
>>>
>>>>
>>>>> - change PerfDataManager::destroy() to simply set _has_PerfData
>>>>>    to zero (field is volatile) and then use a fence() to prevent
>>>>>    any reordering of operations in any direction; it's only done
>>>>>    once during VM shutdown so...
>>>>
>>>> See previous email :)
>>>
>>> Yup. I did and I'd still like to keep it for the reasons stated
>>> in my reply to the previous e-mail. You were good with this code
>>> as a release_store() of '0' so I kind of expected you to be OK
>>> with the stronger fence().
>>>
>>>
>>>>
>>>>> - change perfMemory_exit() to only call PerfDataManager::destroy()
>>>>>    when called at a safepoint and when the StatSample is not
>>>>>    running; this means when the VM is aborting, we no longer have
>>>>>    a race between the original crash report and this code path.
>>>>
>>>> Sounds reasonable.
>>>
>>> Thanks. I hope it makes Kim happy also.
>>>
>>>
>>>>
>>>> I have few specific comments/nits:
>>>>
>>>> src/share/vm/runtime/perfData.cpp
>>>>
>>>>  43 volatile jint   PerfDataManager::_has_PerfData = false;
>>>>
>>>> Nit: jint should be initialized with 0 not false.
>>>
>>> Will fix; oops...that made it past round 0...
>>>
>>>
>>>>
>>>>  287   // manipulation before we free the memory. The two alternatives
>>>>  288   // are 1) leak the PerfData memory or 2) do some form of 
>>>> ordered
>>>>  289   // access before every PerfData operation.
>>>>
>>>> No amount of ordered access will fix the race - only synchronization
>>>> of some form.
>>>
>>> I thought I had changed that comment. Will fix.
>>>
>>>
>>>>
>>>>  315 void PerfDataManager::add_item(PerfData* p, bool sampled) {
>>>>  316
>>>>  317   MutexLocker ml(PerfDataManager_lock);
>>>>  318
>>>>  319   if (_all == NULL) {
>>>>  320     _all = new PerfDataList(100);
>>>>  321     OrderAccess::release_store(&_has_PerfData, 1);
>>>>  322   }
>>>>
>>>> I can't see any purpose for a release_store here. You have one-time
>>>> initialization code guarded with a mutex. A release_store adds
>>>> nothing here - other than future head-scratching from maintainers of
>>>> this code.
>>>
>>> Agreed. Not sure how I missed the fact that we were in the
>>> locked block.  oops...that made it past round 0...
>>>
>>> Thanks again for the review. Will generate another webrev...
>>>
>>> Dan
>>>
>>>
>>>>
>>>> Thanks,
>>>> David
>>>>
>>>>> I believe that I've addressed all comments from round 0.
>>>>>
>>>>> Thanks, in advance, for any comments, questions or suggestions.
>>>>>
>>>>> Dan
>>>>>
>>>>>
>>>>> On 8/25/15 3:08 PM, Daniel D. Daugherty wrote:
>>>>>> Greetings,
>>>>>>
>>>>>> I have a "fix" for a long standing race between JVM shutdown and the
>>>>>> JVM statistics subsystem:
>>>>>>
>>>>>> JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8049304
>>>>>>
>>>>>> Webrev URL:
>>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/0-jdk9-hs-rt/
>>>>>>
>>>>>> Testing: Aurora Adhoc RT-SVC nightly batch
>>>>>>          Aurora Adhoc vm.tmtools batch
>>>>>>          Kim's repro sequence for JDK-8049304
>>>>>>          Kim's repro sequence for JDK-8129978
>>>>>>          JPRT -testset hotspot
>>>>>>
>>>>>> This "fix":
>>>>>>
>>>>>> - adds a volatile flag to record whether PerfDataManager is holding
>>>>>>   data (PerfData objects)
>>>>>> - adds PerfDataManager::has_PerfData() to return the flag
>>>>>> - changes the Java monitor subsystem's use of PerfData to
>>>>>>   check both allocation of the monitor subsystem specific
>>>>>>   PerfData object and the new PerfDataManager::has_PerfData()
>>>>>>   return value
>>>>>>
>>>>>> If the global 'UsePerfData' option is false, the system works as
>>>>>> it did before. If 'UsePerfData' is true (the default on non-embedded
>>>>>> systems), the Java monitor subsystem will allocate a number of
>>>>>> PerfData objects to record information. The objects will record
>>>>>> information about Java monitor subsystem until the JVM shuts down.
>>>>>>
>>>>>> When the JVM starts to shutdown, the new PerfDataManager flag will
>>>>>> change to false and the Java monitor subsystem will stop using the
>>>>>> PerfData objects. This is the new behavior. As noted in the comments
>>>>>> I added to the code, the race is still present; I'm just changing
>>>>>> the order and the timing to reduce the likelihood of the crash.
>>>>>>
>>>>>> Thanks, in advance, for any comments, questions or suggestions.
>>>>>>
>>>>>> Dan
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>>>
>>>
>>