RFR (S) 8049304: race between VM_Exit and _sync_FutileWakeups->inc()
Daniel D. Daugherty
daniel.daugherty at oracle.com
Wed Sep 2 19:41:04 UTC 2015
On 9/2/15 1:34 PM, Karen Kinnear wrote:
> Dan,
>
> I did not do a complete review - I have been following the conversation and you all seem to have this completely
> in hand so I didn't jump in. Yell if need more.
I have four reviewers so far. Just waiting for Kim to decide
on the latest version...
Dan
>
> thanks,
> Karen
>
> On Sep 2, 2015, at 3:27 PM, Daniel D. Daugherty wrote:
>
>> On 9/2/15 12:50 PM, Karen Kinnear wrote:
>>> Dan,
>>>
>>> Can you possibly change the two "in a parallel" to "in parallel" ?
>> Nice catch. Will fix.
>>
>> Did you do a complete review or did that glaring typo just
>> jump out at you?
>>
>> Dan
>>
>>
>>> thanks,
>>> Karen
>>>
>>> On Sep 2, 2015, at 1:06 PM, Tom Benson wrote:
>>>
>>>> Looks good to me!
>>>> Tnx,
>>>> Tom
>>>>
>>>> On 9/2/2015 12:40 PM, Daniel D. Daugherty wrote:
>>>>> Just for the record, here are the comment context diffs:
>>>>>
>>>>> $ diff -c src/share/vm/runtime/perfMemory.cpp{.cr2,}*** src/share/vm/runtime/perfMemory.cpp.cr2 Tue Sep 1 19:39:45 2015
>>>>> --- src/share/vm/runtime/perfMemory.cpp Wed Sep 2 09:35:48 2015
>>>>> ***************
>>>>> *** 70,76 ****
>>>>> // objects that are currently being used by running JavaThreads
>>>>> // or the StatSampler. This method is invoked while we are not at
>>>>> // a safepoint during a VM abort so leaving the PerfData objects
>>>>> ! // around may also help diagnose the failure.
>>>>> //
>>>>> if (SafepointSynchronize::is_at_safepoint() && !StatSampler::is_active()) {
>>>>> PerfDataManager::destroy();
>>>>> --- 70,78 ----
>>>>> // objects that are currently being used by running JavaThreads
>>>>> // or the StatSampler. This method is invoked while we are not at
>>>>> // a safepoint during a VM abort so leaving the PerfData objects
>>>>> ! // around may also help diagnose the failure. In rare cases,
>>>>> ! // PerfData objects are used in parallel with a safepoint. See
>>>>> ! // the work around in PerfDataManager::destroy().
>>>>> //
>>>>> if (SafepointSynchronize::is_at_safepoint() && !StatSampler::is_active()) {
>>>>> PerfDataManager::destroy();
>>>>>
>>>>>
>>>>> $ diff -c src/share/vm/runtime/objectMonitor.cpp{.cr2,}
>>>>> *** src/share/vm/runtime/objectMonitor.cpp.cr2 Tue Sep 1 19:23:35 2015
>>>>> --- src/share/vm/runtime/objectMonitor.cpp Wed Sep 2 09:37:08 2015
>>>>> ***************
>>>>> *** 572,577 ****
>>>>> --- 572,579 ----
>>>>> // That is by design - we trade "lossy" counters which are exposed to
>>>>> // races during updates for a lower probe effect.
>>>>> TEVENT(Inflated enter - Futile wakeup);
>>>>> + // This PerfData object can be used in a parallel with a safepoint.
>>>>> + // See the work around in PerfDataManager::destroy().
>>>>> OM_PERFDATA_OP(FutileWakeups, inc());
>>>>> ++nWakeups;
>>>>>
>>>>> ***************
>>>>> *** 744,749 ****
>>>>> --- 746,753 ----
>>>>> // *must* retry _owner before parking.
>>>>> OrderAccess::fence();
>>>>>
>>>>> + // This PerfData object can be used in a parallel with a safepoint.
>>>>> + // See the work around in PerfDataManager::destroy().
>>>>> OM_PERFDATA_OP(FutileWakeups, inc());
>>>>> }
>>>>>
>>>>>
>>>>> Dan
>>>>>
>>>>>
>>>>> On 9/2/15 10:03 AM, Daniel D. Daugherty wrote:
>>>>>> On 9/2/15 9:40 AM, Tom Benson wrote:
>>>>>>> Hi Dan,
>>>>>>> OK. I didn't review what was added in round 1 once you said it was all removed for round 2. 8^)
>>>>>> Not "all", but I did remove "most" of the round 1 changes :-)
>>>>>> The changes I kept are called in the list below.
>>>>>>
>>>>>>
>>>>>>> It would be great if what you have in your first paragraph below was added to the comments. I think the existing comment in perfMemory_exit implies we're safe to remove the PerfData objects without fear of them being in use because we're at a safepoint.
>>>>>> I think I'll add this sentence to the comment in perfMemory_exit():
>>>>>>
>>>>>> // In rare cases, PerfData objects are used in parallel with a
>>>>>> // safepoint. See the work around in PerfDataManager::destroy().
>>>>>>
>>>>>>
>>>>>>> Perhaps better to have it (the new comment) in PerfDataManager::destroy(), because it seems so weird to have a sleep in the VM thread during a safepoint, even in a shutdown path.
>>>>>> I think the PerfDataManager::destroy() comment is clear about
>>>>>> the race we're trying avoid. Again, if you have specific wording
>>>>>> changes to suggest to make it more clear... I'll take them. :-)
>>>>>>
>>>>>> I think I'll also add this comment:
>>>>>>
>>>>>> // This PerfData object can be used in a parallel with a safepoint.
>>>>>> // See the work around in PerfDataManager::destroy().
>>>>>>
>>>>>> above these lines in src/share/vm/runtime/objectMonitor.cpp:
>>>>>>
>>>>>> 575 OM_PERFDATA_OP(FutileWakeups, inc());
>>>>>> 747 OM_PERFDATA_OP(FutileWakeups, inc());
>>>>>>
>>>>>>
>>>>>>> Any interest in asserting that you're at a safepoint in PerfDataManager::destroy? Just a thought.
>>>>>> I'd rather not add an assert() at this time.
>>>>>>
>>>>>> Are you good with the above comment additions? Do you need to
>>>>>> see another webrev when I make those changes?
>>>>>>
>>>>>> Dan
>>>>>>
>>>>>>
>>>>>>> Tom
>>>>>>>
>>>>>>> On 9/2/2015 11:15 AM, Daniel D. Daugherty wrote:
>>>>>>>> On 9/2/15 8:49 AM, Tom Benson wrote:
>>>>>>>>> Hi,
>>>>>>>>> I'm a bit confused on one point... Since you now only call PerfDataManager::destroy at a safepoint, why do you still have the comment about 'the race' and the sleep?
>>>>>>>> Because the two "futile wakeup" counter updates in the monitor
>>>>>>>> subsystem can execute in parallel with a safepoint. The JavaThread
>>>>>>>> state is "blocked" so the safepoint subsystem will see the JavaThread
>>>>>>>> as "at a safepoint" when it is actually executing the code to
>>>>>>>> increment the counter.
>>>>>>>>
>>>>>>>> That's what the "is_safe" parameter to the OM_PERFDATA_OP macro was
>>>>>>>> all about in the round 1 code review. However, David convinced me
>>>>>>>> that all that logic didn't guarantee we wouldn't hit the race so
>>>>>>>> I ripped it all out in the round 2 code review (this one).
>>>>>>>>
>>>>>>>> Does this help your confusion?
>>>>>>>>
>>>>>>>> Dan
>>>>>>>>
>>>>>>>>
>>>>>>>>> Tom
>>>>>>>>>
>>>>>>>>> On 9/2/2015 7:52 AM, Daniel D. Daugherty wrote:
>>>>>>>>>> David,
>>>>>>>>>>
>>>>>>>>>> Thanks for the very fast re-review!
>>>>>>>>>>
>>>>>>>>>> Enjoy your vacation!
>>>>>>>>>>
>>>>>>>>>> Dan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 9/2/15 2:54 AM, David Holmes wrote:
>>>>>>>>>>> Hi Dan,
>>>>>>>>>>>
>>>>>>>>>>> On 2/09/2015 2:45 PM, Daniel D. Daugherty wrote:
>>>>>>>>>>>> Greetings,
>>>>>>>>>>>>
>>>>>>>>>>>> I've updated the "fix" for this bug based on code review comments
>>>>>>>>>>>> received in round 1. I've dropped most of the changes from round 1
>>>>>>>>>>>> with a couple of exceptions.
>>>>>>>>>>> I have no further comments - it all looks good to me. If others want the pendulum to swing back a little from this position then ... nothing that has been suggested is functionally wrong. :)
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> David
>>>>>>>>>>> -----
>>>>>>>>>>>
>>>>>>>>>>> PS. When you get back from vacation I'll be gone for a month. That gives you a large window to push other things through with less stress ;-)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
>>>>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8049304
>>>>>>>>>>>>
>>>>>>>>>>>> Webrev URL: http://cr.openjdk.java.net/~dcubed/8049304-webrev/2-jdk9-hs-rt/
>>>>>>>>>>>>
>>>>>>>>>>>> The easiest way to re-review is to download the two patch files
>>>>>>>>>>>> (round 0 and round 2) and view them in your favorite file merge tool:
>>>>>>>>>>>>
>>>>>>>>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/0-jdk9-hs-rt/hotspot.patch
>>>>>>>>>>>>
>>>>>>>>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/2-jdk9-hs-rt/hotspot.patch
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Testing: Aurora Adhoc RT-SVC nightly batch (in process)
>>>>>>>>>>>> Aurora Adhoc vm.tmtools batch (in process)
>>>>>>>>>>>> Kim's repro sequence for JDK-8049304
>>>>>>>>>>>> Kim's repro sequence for JDK-8129978
>>>>>>>>>>>> JPRT -testset hotspot
>>>>>>>>>>>>
>>>>>>>>>>>> Changes between round 0 and round 2:
>>>>>>>>>>>>
>>>>>>>>>>>> - clarify a few comments
>>>>>>>>>>>> - init _has_PerfData flag with '0' (instead of false)
>>>>>>>>>>>> - drop unnecessary use OrderAccess::release_store() to set
>>>>>>>>>>>> _has_PerfData to '1' (we're in a Mutex)
>>>>>>>>>>>> - change perfMemory_exit() to only call PerfDataManager::destroy()
>>>>>>>>>>>> when called at a safepoint and when the StatSampler is not
>>>>>>>>>>>> running; this means when the VM is aborting, we no longer have
>>>>>>>>>>>> a race between the original crash report and this code path.
>>>>>>>>>>>>
>>>>>>>>>>>> Changes between round 1 and round 2:
>>>>>>>>>>>>
>>>>>>>>>>>> - clarify a few comments
>>>>>>>>>>>> - drop is_safe parameter to OM_PERFDATA_OP macro
>>>>>>>>>>>> - init _has_PerfData flag with '0' (instead of false)
>>>>>>>>>>>> - drop OrderAccess::fence() call before os::naked_short_sleep() call
>>>>>>>>>>>> - drop PerfDataManager::has_PerfData_with_acquire()
>>>>>>>>>>>> - drop unnecessary use OrderAccess::release_store() to set
>>>>>>>>>>>> _has_PerfData to '1' (we're in a Mutex)
>>>>>>>>>>>>
>>>>>>>>>>>> I believe that I've addressed all comments from round 0 and
>>>>>>>>>>>> from round 1.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks, in advance, for any comments, questions or suggestions.
>>>>>>>>>>>>
>>>>>>>>>>>> Dan
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 8/31/15 4:51 PM, Daniel D. Daugherty wrote:
>>>>>>>>>>>>> Greetings,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I've updated the "fix" for this bug based on code review comments
>>>>>>>>>>>>> received in round 0.
>>>>>>>>>>>>>
>>>>>>>>>>>>> JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
>>>>>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8049304
>>>>>>>>>>>>>
>>>>>>>>>>>>> Webrev URL:
>>>>>>>>>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/1-jdk9-hs-rt/
>>>>>>>>>>>>>
>>>>>>>>>>>>> The easiest way to re-review is to download the two patch files
>>>>>>>>>>>>> and view them in your favorite file merge tool:
>>>>>>>>>>>>>
>>>>>>>>>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/0-jdk9-hs-rt/hotspot.patch
>>>>>>>>>>>>>
>>>>>>>>>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/1-jdk9-hs-rt/hotspot.patch
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Testing: Aurora Adhoc RT-SVC nightly batch (in process)
>>>>>>>>>>>>> Aurora Adhoc vm.tmtools batch (in process)
>>>>>>>>>>>>> Kim's repro sequence for JDK-8049304
>>>>>>>>>>>>> Kim's repro sequence for JDK-8129978
>>>>>>>>>>>>> JPRT -testset hotspot
>>>>>>>>>>>>>
>>>>>>>>>>>>> Changes between round 0 and round 1:
>>>>>>>>>>>>>
>>>>>>>>>>>>> - add an 'is_safe' parameter to the OM_PERFDATA_OP macro;
>>>>>>>>>>>>> safepoint-safe callers can access _has_PerfData flag directly;
>>>>>>>>>>>>> non-safepoint-safe callers use a load-acquire to fetch the
>>>>>>>>>>>>> current _has_PerfData flag value
>>>>>>>>>>>>> - change PerfDataManager::destroy() to simply set _has_PerfData
>>>>>>>>>>>>> to zero (field is volatile) and then use a fence() to prevent
>>>>>>>>>>>>> any reordering of operations in any direction; it's only done
>>>>>>>>>>>>> once during VM shutdown so...
>>>>>>>>>>>>> - change perfMemory_exit() to only call PerfDataManager::destroy()
>>>>>>>>>>>>> when called at a safepoint and when the StatSample is not
>>>>>>>>>>>>> running; this means when the VM is aborting, we no longer have
>>>>>>>>>>>>> a race between the original crash report and this code path.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I believe that I've addressed all comments from round 0.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks, in advance, for any comments, questions or suggestions.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Dan
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 8/25/15 3:08 PM, Daniel D. Daugherty wrote:
>>>>>>>>>>>>>> Greetings,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have a "fix" for a long standing race between JVM shutdown and the
>>>>>>>>>>>>>> JVM statistics subsystem:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
>>>>>>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8049304
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Webrev URL:
>>>>>>>>>>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/0-jdk9-hs-rt/
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Testing: Aurora Adhoc RT-SVC nightly batch
>>>>>>>>>>>>>> Aurora Adhoc vm.tmtools batch
>>>>>>>>>>>>>> Kim's repro sequence for JDK-8049304
>>>>>>>>>>>>>> Kim's repro sequence for JDK-8129978
>>>>>>>>>>>>>> JPRT -testset hotspot
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This "fix":
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - adds a volatile flag to record whether PerfDataManager is holding
>>>>>>>>>>>>>> data (PerfData objects)
>>>>>>>>>>>>>> - adds PerfDataManager::has_PerfData() to return the flag
>>>>>>>>>>>>>> - changes the Java monitor subsystem's use of PerfData to
>>>>>>>>>>>>>> check both allocation of the monitor subsystem specific
>>>>>>>>>>>>>> PerfData object and the new PerfDataManager::has_PerfData()
>>>>>>>>>>>>>> return value
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If the global 'UsePerfData' option is false, the system works as
>>>>>>>>>>>>>> it did before. If 'UsePerfData' is true (the default on non-embedded
>>>>>>>>>>>>>> systems), the Java monitor subsystem will allocate a number of
>>>>>>>>>>>>>> PerfData objects to record information. The objects will record
>>>>>>>>>>>>>> information about Java monitor subsystem until the JVM shuts down.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> When the JVM starts to shutdown, the new PerfDataManager flag will
>>>>>>>>>>>>>> change to false and the Java monitor subsystem will stop using the
>>>>>>>>>>>>>> PerfData objects. This is the new behavior. As noted in the comments
>>>>>>>>>>>>>> I added to the code, the race is still present; I'm just changing
>>>>>>>>>>>>>> the order and the timing to reduce the likelihood of the crash.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks, in advance, for any comments, questions or suggestions.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Dan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
More information about the hotspot-runtime-dev
mailing list