RFR (S) 8049304: race between VM_Exit and _sync_FutileWakeups->inc()

Wed Sep 2 19:34:02 UTC 2015

Dan,

I did not do a complete review - I have been following the conversation and you all seem to have this completely
in hand so I didn't jump in. Yell if need more.

thanks,
Karen

On Sep 2, 2015, at 3:27 PM, Daniel D. Daugherty wrote:

> On 9/2/15 12:50 PM, Karen Kinnear wrote:
>> Dan,
>> 
>> Can you possibly change the two "in a parallel" to "in parallel" ?
> 
> Nice catch. Will fix.
> 
> Did you do a complete review or did that glaring typo just
> jump out at you?
> 
> Dan
> 
> 
>> 
>> thanks,
>> Karen
>> 
>> On Sep 2, 2015, at 1:06 PM, Tom Benson wrote:
>> 
>>> Looks good to me!
>>> Tnx,
>>> Tom
>>> 
>>> On 9/2/2015 12:40 PM, Daniel D. Daugherty wrote:
>>>> Just for the record, here are the comment context diffs:
>>>> 
>>>> $ diff -c src/share/vm/runtime/perfMemory.cpp{.cr2,}*** src/share/vm/runtime/perfMemory.cpp.cr2     Tue Sep  1 19:39:45 2015
>>>> --- src/share/vm/runtime/perfMemory.cpp Wed Sep  2 09:35:48 2015
>>>> ***************
>>>> *** 70,76 ****
>>>>    // objects that are currently being used by running JavaThreads
>>>>    // or the StatSampler. This method is invoked while we are not at
>>>>    // a safepoint during a VM abort so leaving the PerfData objects
>>>> !   // around may also help diagnose the failure.
>>>>    //
>>>>    if (SafepointSynchronize::is_at_safepoint() && !StatSampler::is_active()) {
>>>>      PerfDataManager::destroy();
>>>> --- 70,78 ----
>>>>    // objects that are currently being used by running JavaThreads
>>>>    // or the StatSampler. This method is invoked while we are not at
>>>>    // a safepoint during a VM abort so leaving the PerfData objects
>>>> !   // around may also help diagnose the failure. In rare cases,
>>>> !   // PerfData objects are used in parallel with a safepoint. See
>>>> !   // the work around in PerfDataManager::destroy().
>>>>    //
>>>>    if (SafepointSynchronize::is_at_safepoint() && !StatSampler::is_active()) {
>>>>      PerfDataManager::destroy();
>>>> 
>>>> 
>>>> $ diff -c src/share/vm/runtime/objectMonitor.cpp{.cr2,}
>>>> *** src/share/vm/runtime/objectMonitor.cpp.cr2  Tue Sep  1 19:23:35 2015
>>>> --- src/share/vm/runtime/objectMonitor.cpp      Wed Sep  2 09:37:08 2015
>>>> ***************
>>>> *** 572,577 ****
>>>> --- 572,579 ----
>>>>      // That is by design - we trade "lossy" counters which are exposed to
>>>>      // races during updates for a lower probe effect.
>>>>      TEVENT(Inflated enter - Futile wakeup);
>>>> +     // This PerfData object can be used in a parallel with a safepoint.
>>>> +     // See the work around in PerfDataManager::destroy().
>>>>      OM_PERFDATA_OP(FutileWakeups, inc());
>>>>      ++nWakeups;
>>>> 
>>>> ***************
>>>> *** 744,749 ****
>>>> --- 746,753 ----
>>>>      // *must* retry  _owner before parking.
>>>>      OrderAccess::fence();
>>>> 
>>>> +     // This PerfData object can be used in a parallel with a safepoint.
>>>> +     // See the work around in PerfDataManager::destroy().
>>>>      OM_PERFDATA_OP(FutileWakeups, inc());
>>>>    }
>>>> 
>>>> 
>>>> Dan
>>>> 
>>>> 
>>>> On 9/2/15 10:03 AM, Daniel D. Daugherty wrote:
>>>>> On 9/2/15 9:40 AM, Tom Benson wrote:
>>>>>> Hi Dan,
>>>>>> OK.  I didn't review what was added in round 1 once you said it was all removed for round 2.  8^)
>>>>> Not "all", but I did remove "most" of the round 1 changes :-)
>>>>> The changes I kept are called in the list below.
>>>>> 
>>>>> 
>>>>>> It would be great if what you have in your first paragraph below was added to the comments.   I think the existing comment in perfMemory_exit implies we're safe to remove the PerfData objects without fear of them being in use because we're at a safepoint.
>>>>> I think I'll add this sentence to the comment in perfMemory_exit():
>>>>> 
>>>>>    // In rare cases, PerfData objects are used in parallel with a
>>>>>    // safepoint. See the work around in PerfDataManager::destroy().
>>>>> 
>>>>> 
>>>>>> Perhaps better to have it (the new comment) in PerfDataManager::destroy(),  because it seems so weird to have a sleep in the VM thread during a safepoint, even in a shutdown path.
>>>>> I think the PerfDataManager::destroy() comment is clear about
>>>>> the race we're trying avoid. Again, if you have specific wording
>>>>> changes to suggest to make it more clear... I'll take them. :-)
>>>>> 
>>>>> I think I'll also add this comment:
>>>>> 
>>>>>    // This PerfData object can be used in a parallel with a safepoint.
>>>>>    // See the work around in PerfDataManager::destroy().
>>>>> 
>>>>> above these lines in src/share/vm/runtime/objectMonitor.cpp:
>>>>> 
>>>>> 575     OM_PERFDATA_OP(FutileWakeups, inc());
>>>>> 747     OM_PERFDATA_OP(FutileWakeups, inc());
>>>>> 
>>>>> 
>>>>>> Any interest in asserting that you're at a safepoint in PerfDataManager::destroy?   Just a thought.
>>>>> I'd rather not add an assert() at this time.
>>>>> 
>>>>> Are you good with the above comment additions? Do you need to
>>>>> see another webrev when I make those changes?
>>>>> 
>>>>> Dan
>>>>> 
>>>>> 
>>>>>> Tom
>>>>>> 
>>>>>> On 9/2/2015 11:15 AM, Daniel D. Daugherty wrote:
>>>>>>> On 9/2/15 8:49 AM, Tom Benson wrote:
>>>>>>>> Hi,
>>>>>>>> I'm a bit confused on one point... Since you now only call PerfDataManager::destroy at a safepoint, why do you still have the comment about 'the race'  and the sleep?
>>>>>>> Because the two "futile wakeup" counter updates in the monitor
>>>>>>> subsystem can execute in parallel with a safepoint. The JavaThread
>>>>>>> state is "blocked" so the safepoint subsystem will see the JavaThread
>>>>>>> as "at a safepoint" when it is actually executing the code to
>>>>>>> increment the counter.
>>>>>>> 
>>>>>>> That's what the "is_safe" parameter to the OM_PERFDATA_OP macro was
>>>>>>> all about in the round 1 code review. However, David convinced me
>>>>>>> that all that logic didn't guarantee we wouldn't hit the race so
>>>>>>> I ripped it all out in the round 2 code review (this one).
>>>>>>> 
>>>>>>> Does this help your confusion?
>>>>>>> 
>>>>>>> Dan
>>>>>>> 
>>>>>>> 
>>>>>>>> Tom
>>>>>>>> 
>>>>>>>> On 9/2/2015 7:52 AM, Daniel D. Daugherty wrote:
>>>>>>>>> David,
>>>>>>>>> 
>>>>>>>>> Thanks for the very fast re-review!
>>>>>>>>> 
>>>>>>>>> Enjoy your vacation!
>>>>>>>>> 
>>>>>>>>> Dan
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 9/2/15 2:54 AM, David Holmes wrote:
>>>>>>>>>> Hi Dan,
>>>>>>>>>> 
>>>>>>>>>> On 2/09/2015 2:45 PM, Daniel D. Daugherty wrote:
>>>>>>>>>>> Greetings,
>>>>>>>>>>> 
>>>>>>>>>>> I've updated the "fix" for this bug based on code review comments
>>>>>>>>>>> received in round 1. I've dropped most of the changes from round 1
>>>>>>>>>>> with a couple of exceptions.
>>>>>>>>>> I have no further comments - it all looks good to me. If others want the pendulum to swing back a little from this position then ... nothing that has been suggested is functionally wrong. :)
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> David
>>>>>>>>>> -----
>>>>>>>>>> 
>>>>>>>>>> PS. When you get back from vacation I'll be gone for a month. That gives you a large window to push other things through with less stress ;-)
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
>>>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8049304
>>>>>>>>>>> 
>>>>>>>>>>> Webrev URL: http://cr.openjdk.java.net/~dcubed/8049304-webrev/2-jdk9-hs-rt/
>>>>>>>>>>> 
>>>>>>>>>>> The easiest way to re-review is to download the two patch files
>>>>>>>>>>> (round 0 and round 2) and view them in your favorite file merge tool:
>>>>>>>>>>> 
>>>>>>>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/0-jdk9-hs-rt/hotspot.patch
>>>>>>>>>>> 
>>>>>>>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/2-jdk9-hs-rt/hotspot.patch
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Testing: Aurora Adhoc RT-SVC nightly batch (in process)
>>>>>>>>>>>          Aurora Adhoc vm.tmtools batch (in process)
>>>>>>>>>>>          Kim's repro sequence for JDK-8049304
>>>>>>>>>>>          Kim's repro sequence for JDK-8129978
>>>>>>>>>>>          JPRT -testset hotspot
>>>>>>>>>>> 
>>>>>>>>>>> Changes between round 0 and round 2:
>>>>>>>>>>> 
>>>>>>>>>>> - clarify a few comments
>>>>>>>>>>> - init _has_PerfData flag with '0' (instead of false)
>>>>>>>>>>> - drop unnecessary use OrderAccess::release_store() to set
>>>>>>>>>>>   _has_PerfData to '1' (we're in a Mutex)
>>>>>>>>>>> - change perfMemory_exit() to only call PerfDataManager::destroy()
>>>>>>>>>>>   when called at a safepoint and when the StatSampler is not
>>>>>>>>>>>   running; this means when the VM is aborting, we no longer have
>>>>>>>>>>>   a race between the original crash report and this code path.
>>>>>>>>>>> 
>>>>>>>>>>> Changes between round 1 and round 2:
>>>>>>>>>>> 
>>>>>>>>>>> - clarify a few comments
>>>>>>>>>>> - drop is_safe parameter to OM_PERFDATA_OP macro
>>>>>>>>>>> - init _has_PerfData flag with '0' (instead of false)
>>>>>>>>>>> - drop OrderAccess::fence() call before os::naked_short_sleep() call
>>>>>>>>>>> - drop PerfDataManager::has_PerfData_with_acquire()
>>>>>>>>>>> - drop unnecessary use OrderAccess::release_store() to set
>>>>>>>>>>>   _has_PerfData to '1' (we're in a Mutex)
>>>>>>>>>>> 
>>>>>>>>>>> I believe that I've addressed all comments from round 0 and
>>>>>>>>>>> from round 1.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks, in advance, for any comments, questions or suggestions.
>>>>>>>>>>> 
>>>>>>>>>>> Dan
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On 8/31/15 4:51 PM, Daniel D. Daugherty wrote:
>>>>>>>>>>>> Greetings,
>>>>>>>>>>>> 
>>>>>>>>>>>> I've updated the "fix" for this bug based on code review comments
>>>>>>>>>>>> received in round 0.
>>>>>>>>>>>> 
>>>>>>>>>>>> JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
>>>>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8049304
>>>>>>>>>>>> 
>>>>>>>>>>>> Webrev URL:
>>>>>>>>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/1-jdk9-hs-rt/
>>>>>>>>>>>> 
>>>>>>>>>>>> The easiest way to re-review is to download the two patch files
>>>>>>>>>>>> and view them in your favorite file merge tool:
>>>>>>>>>>>> 
>>>>>>>>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/0-jdk9-hs-rt/hotspot.patch
>>>>>>>>>>>> 
>>>>>>>>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/1-jdk9-hs-rt/hotspot.patch
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Testing: Aurora Adhoc RT-SVC nightly batch (in process)
>>>>>>>>>>>>         Aurora Adhoc vm.tmtools batch (in process)
>>>>>>>>>>>>         Kim's repro sequence for JDK-8049304
>>>>>>>>>>>>         Kim's repro sequence for JDK-8129978
>>>>>>>>>>>>         JPRT -testset hotspot
>>>>>>>>>>>> 
>>>>>>>>>>>> Changes between round 0 and round 1:
>>>>>>>>>>>> 
>>>>>>>>>>>> - add an 'is_safe' parameter to the OM_PERFDATA_OP macro;
>>>>>>>>>>>>  safepoint-safe callers can access _has_PerfData flag directly;
>>>>>>>>>>>>  non-safepoint-safe callers use a load-acquire to fetch the
>>>>>>>>>>>>  current _has_PerfData flag value
>>>>>>>>>>>> - change PerfDataManager::destroy() to simply set _has_PerfData
>>>>>>>>>>>>  to zero (field is volatile) and then use a fence() to prevent
>>>>>>>>>>>>  any reordering of operations in any direction; it's only done
>>>>>>>>>>>>  once during VM shutdown so...
>>>>>>>>>>>> - change perfMemory_exit() to only call PerfDataManager::destroy()
>>>>>>>>>>>>  when called at a safepoint and when the StatSample is not
>>>>>>>>>>>>  running; this means when the VM is aborting, we no longer have
>>>>>>>>>>>>  a race between the original crash report and this code path.
>>>>>>>>>>>> 
>>>>>>>>>>>> I believe that I've addressed all comments from round 0.
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks, in advance, for any comments, questions or suggestions.
>>>>>>>>>>>> 
>>>>>>>>>>>> Dan
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On 8/25/15 3:08 PM, Daniel D. Daugherty wrote:
>>>>>>>>>>>>> Greetings,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I have a "fix" for a long standing race between JVM shutdown and the
>>>>>>>>>>>>> JVM statistics subsystem:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
>>>>>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8049304
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Webrev URL:
>>>>>>>>>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/0-jdk9-hs-rt/
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Testing: Aurora Adhoc RT-SVC nightly batch
>>>>>>>>>>>>>         Aurora Adhoc vm.tmtools batch
>>>>>>>>>>>>>         Kim's repro sequence for JDK-8049304
>>>>>>>>>>>>>         Kim's repro sequence for JDK-8129978
>>>>>>>>>>>>>         JPRT -testset hotspot
>>>>>>>>>>>>> 
>>>>>>>>>>>>> This "fix":
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - adds a volatile flag to record whether PerfDataManager is holding
>>>>>>>>>>>>>  data (PerfData objects)
>>>>>>>>>>>>> - adds PerfDataManager::has_PerfData() to return the flag
>>>>>>>>>>>>> - changes the Java monitor subsystem's use of PerfData to
>>>>>>>>>>>>>  check both allocation of the monitor subsystem specific
>>>>>>>>>>>>>  PerfData object and the new PerfDataManager::has_PerfData()
>>>>>>>>>>>>>  return value
>>>>>>>>>>>>> 
>>>>>>>>>>>>> If the global 'UsePerfData' option is false, the system works as
>>>>>>>>>>>>> it did before. If 'UsePerfData' is true (the default on non-embedded
>>>>>>>>>>>>> systems), the Java monitor subsystem will allocate a number of
>>>>>>>>>>>>> PerfData objects to record information. The objects will record
>>>>>>>>>>>>> information about Java monitor subsystem until the JVM shuts down.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> When the JVM starts to shutdown, the new PerfDataManager flag will
>>>>>>>>>>>>> change to false and the Java monitor subsystem will stop using the
>>>>>>>>>>>>> PerfData objects. This is the new behavior. As noted in the comments
>>>>>>>>>>>>> I added to the code, the race is still present; I'm just changing
>>>>>>>>>>>>> the order and the timing to reduce the likelihood of the crash.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks, in advance, for any comments, questions or suggestions.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Dan
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>> 
>