RFR (S) 8049304: race between VM_Exit and _sync_FutileWakeups->inc()

Wed Sep 2 17:19:02 UTC 2015

Thanks!

Dan

On 9/2/15 11:06 AM, Tom Benson wrote:
> Looks good to me!
> Tnx,
> Tom
>
> On 9/2/2015 12:40 PM, Daniel D. Daugherty wrote:
>> Just for the record, here are the comment context diffs:
>>
>> $ diff -c src/share/vm/runtime/perfMemory.cpp{.cr2,}*** 
>> src/share/vm/runtime/perfMemory.cpp.cr2     Tue Sep  1 19:39:45 2015
>> --- src/share/vm/runtime/perfMemory.cpp Wed Sep  2 09:35:48 2015
>> ***************
>> *** 70,76 ****
>>     // objects that are currently being used by running JavaThreads
>>     // or the StatSampler. This method is invoked while we are not at
>>     // a safepoint during a VM abort so leaving the PerfData objects
>> !   // around may also help diagnose the failure.
>>     //
>>     if (SafepointSynchronize::is_at_safepoint() && 
>> !StatSampler::is_active()) {
>>       PerfDataManager::destroy();
>> --- 70,78 ----
>>     // objects that are currently being used by running JavaThreads
>>     // or the StatSampler. This method is invoked while we are not at
>>     // a safepoint during a VM abort so leaving the PerfData objects
>> !   // around may also help diagnose the failure. In rare cases,
>> !   // PerfData objects are used in parallel with a safepoint. See
>> !   // the work around in PerfDataManager::destroy().
>>     //
>>     if (SafepointSynchronize::is_at_safepoint() && 
>> !StatSampler::is_active()) {
>>       PerfDataManager::destroy();
>>
>>
>> $ diff -c src/share/vm/runtime/objectMonitor.cpp{.cr2,}
>> *** src/share/vm/runtime/objectMonitor.cpp.cr2  Tue Sep  1 19:23:35 2015
>> --- src/share/vm/runtime/objectMonitor.cpp      Wed Sep  2 09:37:08 2015
>> ***************
>> *** 572,577 ****
>> --- 572,579 ----
>>       // That is by design - we trade "lossy" counters which are 
>> exposed to
>>       // races during updates for a lower probe effect.
>>       TEVENT(Inflated enter - Futile wakeup);
>> +     // This PerfData object can be used in a parallel with a 
>> safepoint.
>> +     // See the work around in PerfDataManager::destroy().
>>       OM_PERFDATA_OP(FutileWakeups, inc());
>>       ++nWakeups;
>>
>> ***************
>> *** 744,749 ****
>> --- 746,753 ----
>>       // *must* retry  _owner before parking.
>>       OrderAccess::fence();
>>
>> +     // This PerfData object can be used in a parallel with a 
>> safepoint.
>> +     // See the work around in PerfDataManager::destroy().
>>       OM_PERFDATA_OP(FutileWakeups, inc());
>>     }
>>
>>
>> Dan
>>
>>
>> On 9/2/15 10:03 AM, Daniel D. Daugherty wrote:
>>> On 9/2/15 9:40 AM, Tom Benson wrote:
>>>> Hi Dan,
>>>> OK.  I didn't review what was added in round 1 once you said it was 
>>>> all removed for round 2.  8^)
>>>
>>> Not "all", but I did remove "most" of the round 1 changes :-)
>>> The changes I kept are called in the list below.
>>>
>>>
>>>> It would be great if what you have in your first paragraph below 
>>>> was added to the comments.   I think the existing comment in 
>>>> perfMemory_exit implies we're safe to remove the PerfData objects 
>>>> without fear of them being in use because we're at a safepoint.
>>>
>>> I think I'll add this sentence to the comment in perfMemory_exit():
>>>
>>>     // In rare cases, PerfData objects are used in parallel with a
>>>     // safepoint. See the work around in PerfDataManager::destroy().
>>>
>>>
>>>> Perhaps better to have it (the new comment) in 
>>>> PerfDataManager::destroy(),  because it seems so weird to have a 
>>>> sleep in the VM thread during a safepoint, even in a shutdown path.
>>>
>>> I think the PerfDataManager::destroy() comment is clear about
>>> the race we're trying avoid. Again, if you have specific wording
>>> changes to suggest to make it more clear... I'll take them. :-)
>>>
>>> I think I'll also add this comment:
>>>
>>>     // This PerfData object can be used in a parallel with a safepoint.
>>>     // See the work around in PerfDataManager::destroy().
>>>
>>> above these lines in src/share/vm/runtime/objectMonitor.cpp:
>>>
>>> 575     OM_PERFDATA_OP(FutileWakeups, inc());
>>> 747     OM_PERFDATA_OP(FutileWakeups, inc());
>>>
>>>
>>>> Any interest in asserting that you're at a safepoint in 
>>>> PerfDataManager::destroy?   Just a thought.
>>>
>>> I'd rather not add an assert() at this time.
>>>
>>> Are you good with the above comment additions? Do you need to
>>> see another webrev when I make those changes?
>>>
>>> Dan
>>>
>>>
>>>> Tom
>>>>
>>>> On 9/2/2015 11:15 AM, Daniel D. Daugherty wrote:
>>>>> On 9/2/15 8:49 AM, Tom Benson wrote:
>>>>>> Hi,
>>>>>> I'm a bit confused on one point... Since you now only call 
>>>>>> PerfDataManager::destroy at a safepoint, why do you still have 
>>>>>> the comment about 'the race'  and the sleep?
>>>>>
>>>>> Because the two "futile wakeup" counter updates in the monitor
>>>>> subsystem can execute in parallel with a safepoint. The JavaThread
>>>>> state is "blocked" so the safepoint subsystem will see the JavaThread
>>>>> as "at a safepoint" when it is actually executing the code to
>>>>> increment the counter.
>>>>>
>>>>> That's what the "is_safe" parameter to the OM_PERFDATA_OP macro was
>>>>> all about in the round 1 code review. However, David convinced me
>>>>> that all that logic didn't guarantee we wouldn't hit the race so
>>>>> I ripped it all out in the round 2 code review (this one).
>>>>>
>>>>> Does this help your confusion?
>>>>>
>>>>> Dan
>>>>>
>>>>>
>>>>>> Tom
>>>>>>
>>>>>> On 9/2/2015 7:52 AM, Daniel D. Daugherty wrote:
>>>>>>> David,
>>>>>>>
>>>>>>> Thanks for the very fast re-review!
>>>>>>>
>>>>>>> Enjoy your vacation!
>>>>>>>
>>>>>>> Dan
>>>>>>>
>>>>>>>
>>>>>>> On 9/2/15 2:54 AM, David Holmes wrote:
>>>>>>>> Hi Dan,
>>>>>>>>
>>>>>>>> On 2/09/2015 2:45 PM, Daniel D. Daugherty wrote:
>>>>>>>>> Greetings,
>>>>>>>>>
>>>>>>>>> I've updated the "fix" for this bug based on code review comments
>>>>>>>>> received in round 1. I've dropped most of the changes from 
>>>>>>>>> round 1
>>>>>>>>> with a couple of exceptions.
>>>>>>>>
>>>>>>>> I have no further comments - it all looks good to me. If others 
>>>>>>>> want the pendulum to swing back a little from this position 
>>>>>>>> then ... nothing that has been suggested is functionally wrong. :)
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> David
>>>>>>>> -----
>>>>>>>>
>>>>>>>> PS. When you get back from vacation I'll be gone for a month. 
>>>>>>>> That gives you a large window to push other things through with 
>>>>>>>> less stress ;-)
>>>>>>>>
>>>>>>>>
>>>>>>>>> JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8049304
>>>>>>>>>
>>>>>>>>> Webrev URL: 
>>>>>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/2-jdk9-hs-rt/
>>>>>>>>>
>>>>>>>>> The easiest way to re-review is to download the two patch files
>>>>>>>>> (round 0 and round 2) and view them in your favorite file 
>>>>>>>>> merge tool:
>>>>>>>>>
>>>>>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/0-jdk9-hs-rt/hotspot.patch 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/2-jdk9-hs-rt/hotspot.patch 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Testing: Aurora Adhoc RT-SVC nightly batch (in process)
>>>>>>>>>           Aurora Adhoc vm.tmtools batch (in process)
>>>>>>>>>           Kim's repro sequence for JDK-8049304
>>>>>>>>>           Kim's repro sequence for JDK-8129978
>>>>>>>>>           JPRT -testset hotspot
>>>>>>>>>
>>>>>>>>> Changes between round 0 and round 2:
>>>>>>>>>
>>>>>>>>> - clarify a few comments
>>>>>>>>> - init _has_PerfData flag with '0' (instead of false)
>>>>>>>>> - drop unnecessary use OrderAccess::release_store() to set
>>>>>>>>>    _has_PerfData to '1' (we're in a Mutex)
>>>>>>>>> - change perfMemory_exit() to only call 
>>>>>>>>> PerfDataManager::destroy()
>>>>>>>>>    when called at a safepoint and when the StatSampler is not
>>>>>>>>>    running; this means when the VM is aborting, we no longer have
>>>>>>>>>    a race between the original crash report and this code path.
>>>>>>>>>
>>>>>>>>> Changes between round 1 and round 2:
>>>>>>>>>
>>>>>>>>> - clarify a few comments
>>>>>>>>> - drop is_safe parameter to OM_PERFDATA_OP macro
>>>>>>>>> - init _has_PerfData flag with '0' (instead of false)
>>>>>>>>> - drop OrderAccess::fence() call before 
>>>>>>>>> os::naked_short_sleep() call
>>>>>>>>> - drop PerfDataManager::has_PerfData_with_acquire()
>>>>>>>>> - drop unnecessary use OrderAccess::release_store() to set
>>>>>>>>>    _has_PerfData to '1' (we're in a Mutex)
>>>>>>>>>
>>>>>>>>> I believe that I've addressed all comments from round 0 and
>>>>>>>>> from round 1.
>>>>>>>>>
>>>>>>>>> Thanks, in advance, for any comments, questions or suggestions.
>>>>>>>>>
>>>>>>>>> Dan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 8/31/15 4:51 PM, Daniel D. Daugherty wrote:
>>>>>>>>>> Greetings,
>>>>>>>>>>
>>>>>>>>>> I've updated the "fix" for this bug based on code review 
>>>>>>>>>> comments
>>>>>>>>>> received in round 0.
>>>>>>>>>>
>>>>>>>>>> JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
>>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8049304
>>>>>>>>>>
>>>>>>>>>> Webrev URL:
>>>>>>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/1-jdk9-hs-rt/
>>>>>>>>>>
>>>>>>>>>> The easiest way to re-review is to download the two patch files
>>>>>>>>>> and view them in your favorite file merge tool:
>>>>>>>>>>
>>>>>>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/0-jdk9-hs-rt/hotspot.patch 
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/1-jdk9-hs-rt/hotspot.patch 
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Testing: Aurora Adhoc RT-SVC nightly batch (in process)
>>>>>>>>>>          Aurora Adhoc vm.tmtools batch (in process)
>>>>>>>>>>          Kim's repro sequence for JDK-8049304
>>>>>>>>>>          Kim's repro sequence for JDK-8129978
>>>>>>>>>>          JPRT -testset hotspot
>>>>>>>>>>
>>>>>>>>>> Changes between round 0 and round 1:
>>>>>>>>>>
>>>>>>>>>> - add an 'is_safe' parameter to the OM_PERFDATA_OP macro;
>>>>>>>>>>   safepoint-safe callers can access _has_PerfData flag directly;
>>>>>>>>>>   non-safepoint-safe callers use a load-acquire to fetch the
>>>>>>>>>>   current _has_PerfData flag value
>>>>>>>>>> - change PerfDataManager::destroy() to simply set _has_PerfData
>>>>>>>>>>   to zero (field is volatile) and then use a fence() to prevent
>>>>>>>>>>   any reordering of operations in any direction; it's only done
>>>>>>>>>>   once during VM shutdown so...
>>>>>>>>>> - change perfMemory_exit() to only call 
>>>>>>>>>> PerfDataManager::destroy()
>>>>>>>>>>   when called at a safepoint and when the StatSample is not
>>>>>>>>>>   running; this means when the VM is aborting, we no longer have
>>>>>>>>>>   a race between the original crash report and this code path.
>>>>>>>>>>
>>>>>>>>>> I believe that I've addressed all comments from round 0.
>>>>>>>>>>
>>>>>>>>>> Thanks, in advance, for any comments, questions or suggestions.
>>>>>>>>>>
>>>>>>>>>> Dan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 8/25/15 3:08 PM, Daniel D. Daugherty wrote:
>>>>>>>>>>> Greetings,
>>>>>>>>>>>
>>>>>>>>>>> I have a "fix" for a long standing race between JVM shutdown 
>>>>>>>>>>> and the
>>>>>>>>>>> JVM statistics subsystem:
>>>>>>>>>>>
>>>>>>>>>>> JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
>>>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8049304
>>>>>>>>>>>
>>>>>>>>>>> Webrev URL:
>>>>>>>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/0-jdk9-hs-rt/
>>>>>>>>>>>
>>>>>>>>>>> Testing: Aurora Adhoc RT-SVC nightly batch
>>>>>>>>>>>          Aurora Adhoc vm.tmtools batch
>>>>>>>>>>>          Kim's repro sequence for JDK-8049304
>>>>>>>>>>>          Kim's repro sequence for JDK-8129978
>>>>>>>>>>>          JPRT -testset hotspot
>>>>>>>>>>>
>>>>>>>>>>> This "fix":
>>>>>>>>>>>
>>>>>>>>>>> - adds a volatile flag to record whether PerfDataManager is 
>>>>>>>>>>> holding
>>>>>>>>>>>   data (PerfData objects)
>>>>>>>>>>> - adds PerfDataManager::has_PerfData() to return the flag
>>>>>>>>>>> - changes the Java monitor subsystem's use of PerfData to
>>>>>>>>>>>   check both allocation of the monitor subsystem specific
>>>>>>>>>>>   PerfData object and the new PerfDataManager::has_PerfData()
>>>>>>>>>>>   return value
>>>>>>>>>>>
>>>>>>>>>>> If the global 'UsePerfData' option is false, the system 
>>>>>>>>>>> works as
>>>>>>>>>>> it did before. If 'UsePerfData' is true (the default on 
>>>>>>>>>>> non-embedded
>>>>>>>>>>> systems), the Java monitor subsystem will allocate a number of
>>>>>>>>>>> PerfData objects to record information. The objects will record
>>>>>>>>>>> information about Java monitor subsystem until the JVM shuts 
>>>>>>>>>>> down.
>>>>>>>>>>>
>>>>>>>>>>> When the JVM starts to shutdown, the new PerfDataManager 
>>>>>>>>>>> flag will
>>>>>>>>>>> change to false and the Java monitor subsystem will stop 
>>>>>>>>>>> using the
>>>>>>>>>>> PerfData objects. This is the new behavior. As noted in the 
>>>>>>>>>>> comments
>>>>>>>>>>> I added to the code, the race is still present; I'm just 
>>>>>>>>>>> changing
>>>>>>>>>>> the order and the timing to reduce the likelihood of the crash.
>>>>>>>>>>>
>>>>>>>>>>> Thanks, in advance, for any comments, questions or suggestions.
>>>>>>>>>>>
>>>>>>>>>>> Dan
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>