RFR (S) 8049304: race between VM_Exit and _sync_FutileWakeups->inc()

Wed Sep 2 16:51:59 UTC 2015

On 9/2/15 10:49 AM, Bertrand Delsart wrote:
> On 02/09/2015 18:21, Daniel D. Daugherty wrote:
>> Bertrand,
>>
>> Welcome to the party!  Replies embedded below...
>>
>>
>> On 9/2/15 10:05 AM, Bertrand Delsart wrote:
>>> On 02/09/2015 17:15, Daniel D. Daugherty wrote:
>>>> On 9/2/15 8:49 AM, Tom Benson wrote:
>>>>> Hi,
>>>>> I'm a bit confused on one point... Since you now only call
>>>>> PerfDataManager::destroy at a safepoint, why do you still have the
>>>>> comment about 'the race'  and the sleep?
>>>>
>>>> Because the two "futile wakeup" counter updates in the monitor
>>>> subsystem can execute in parallel with a safepoint. The JavaThread
>>>> state is "blocked" so the safepoint subsystem will see the JavaThread
>>>> as "at a safepoint" when it is actually executing the code to
>>>> increment the counter.
>>>>
>>>> That's what the "is_safe" parameter to the OM_PERFDATA_OP macro was
>>>> all about in the round 1 code review. However, David convinced me
>>>> that all that logic didn't guarantee we wouldn't hit the race so
>>>> I ripped it all out in the round 2 code review (this one).
>>>
>>> I did not look at round 1 but I agree that it would be hard to find a
>>> logic that guarantees we wouldn't hit the race.
>>
>> Agreed. We would have to add synchronization and we don't want to
>> do that due to the impact on all PerfData usage.
>>
>>
>>> The proposed changeset is a good step toward reducing its likelihood.
>>> Even more importantly, what I do like in the proposed webrev is the
>>> fact that all the logic is now mostly in a macro + the destroy code,
>>> which we can easily refine later.
>>
>> Thanks.
>>
>>
>>> This is IMHO a good step forward (and a nice work) and the reason why
>>> I would approve the changeset as is.
>>
>> Thanks and I'll list you as a reviewer.
>>
>>
>>> One non mandatory improvement is to further reduce the likelihood of
>>> the bug by doing the PerfDataManager::destroy() in two steps. Once
>>> executed ASAP in the shutdown process to set _has_PerfData to 0 and
>>> one executed as late as possible to perform the cleanup we delayed.
>>> That may be safer than relying only on the 'os::naked_short_sleep(1);'
>>> (which we can still put at the beginning of the delayed method).
>>
>> I think we bounced that idea around a bit in off-thread discussions
>> before I really spun up on this bug. I believe the conclusion was
>> that would add more confusion to the VM exit/shutdown procedure and
>> it's complicated enough as it is...
>>
>> Are you good with this changeset modulo the comment additions I
>> just posted in my most recent reply to Tom?
>
> Approved.

Thanks!

Dan

>
> Bertrand (not a Reviewer)
>
>>
>> Dan
>>
>>
>>>
>>> Regards,
>>>
>>> Bertrand.
>>>
>>>>
>>>> Does this help your confusion?
>>>>
>>>> Dan
>>>>
>>>>
>>>>> Tom
>>>>>
>>>>> On 9/2/2015 7:52 AM, Daniel D. Daugherty wrote:
>>>>>> David,
>>>>>>
>>>>>> Thanks for the very fast re-review!
>>>>>>
>>>>>> Enjoy your vacation!
>>>>>>
>>>>>> Dan
>>>>>>
>>>>>>
>>>>>> On 9/2/15 2:54 AM, David Holmes wrote:
>>>>>>> Hi Dan,
>>>>>>>
>>>>>>> On 2/09/2015 2:45 PM, Daniel D. Daugherty wrote:
>>>>>>>> Greetings,
>>>>>>>>
>>>>>>>> I've updated the "fix" for this bug based on code review comments
>>>>>>>> received in round 1. I've dropped most of the changes from round 1
>>>>>>>> with a couple of exceptions.
>>>>>>>
>>>>>>> I have no further comments - it all looks good to me. If others 
>>>>>>> want
>>>>>>> the pendulum to swing back a little from this position then ...
>>>>>>> nothing that has been suggested is functionally wrong. :)
>>>>>>>
>>>>>>> Thanks,
>>>>>>> David
>>>>>>> -----
>>>>>>>
>>>>>>> PS. When you get back from vacation I'll be gone for a month. That
>>>>>>> gives you a large window to push other things through with less
>>>>>>> stress ;-)
>>>>>>>
>>>>>>>
>>>>>>>> JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8049304
>>>>>>>>
>>>>>>>> Webrev URL:
>>>>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/2-jdk9-hs-rt/
>>>>>>>>
>>>>>>>> The easiest way to re-review is to download the two patch files
>>>>>>>> (round 0 and round 2) and view them in your favorite file merge
>>>>>>>> tool:
>>>>>>>>
>>>>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/0-jdk9-hs-rt/hotspot.patch 
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/2-jdk9-hs-rt/hotspot.patch 
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Testing: Aurora Adhoc RT-SVC nightly batch (in process)
>>>>>>>>           Aurora Adhoc vm.tmtools batch (in process)
>>>>>>>>           Kim's repro sequence for JDK-8049304
>>>>>>>>           Kim's repro sequence for JDK-8129978
>>>>>>>>           JPRT -testset hotspot
>>>>>>>>
>>>>>>>> Changes between round 0 and round 2:
>>>>>>>>
>>>>>>>> - clarify a few comments
>>>>>>>> - init _has_PerfData flag with '0' (instead of false)
>>>>>>>> - drop unnecessary use OrderAccess::release_store() to set
>>>>>>>>    _has_PerfData to '1' (we're in a Mutex)
>>>>>>>> - change perfMemory_exit() to only call PerfDataManager::destroy()
>>>>>>>>    when called at a safepoint and when the StatSampler is not
>>>>>>>>    running; this means when the VM is aborting, we no longer have
>>>>>>>>    a race between the original crash report and this code path.
>>>>>>>>
>>>>>>>> Changes between round 1 and round 2:
>>>>>>>>
>>>>>>>> - clarify a few comments
>>>>>>>> - drop is_safe parameter to OM_PERFDATA_OP macro
>>>>>>>> - init _has_PerfData flag with '0' (instead of false)
>>>>>>>> - drop OrderAccess::fence() call before os::naked_short_sleep() 
>>>>>>>> call
>>>>>>>> - drop PerfDataManager::has_PerfData_with_acquire()
>>>>>>>> - drop unnecessary use OrderAccess::release_store() to set
>>>>>>>>    _has_PerfData to '1' (we're in a Mutex)
>>>>>>>>
>>>>>>>> I believe that I've addressed all comments from round 0 and
>>>>>>>> from round 1.
>>>>>>>>
>>>>>>>> Thanks, in advance, for any comments, questions or suggestions.
>>>>>>>>
>>>>>>>> Dan
>>>>>>>>
>>>>>>>>
>>>>>>>> On 8/31/15 4:51 PM, Daniel D. Daugherty wrote:
>>>>>>>>> Greetings,
>>>>>>>>>
>>>>>>>>> I've updated the "fix" for this bug based on code review comments
>>>>>>>>> received in round 0.
>>>>>>>>>
>>>>>>>>> JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8049304
>>>>>>>>>
>>>>>>>>> Webrev URL:
>>>>>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/1-jdk9-hs-rt/
>>>>>>>>>
>>>>>>>>> The easiest way to re-review is to download the two patch files
>>>>>>>>> and view them in your favorite file merge tool:
>>>>>>>>>
>>>>>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/0-jdk9-hs-rt/hotspot.patch 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/1-jdk9-hs-rt/hotspot.patch 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Testing: Aurora Adhoc RT-SVC nightly batch (in process)
>>>>>>>>>          Aurora Adhoc vm.tmtools batch (in process)
>>>>>>>>>          Kim's repro sequence for JDK-8049304
>>>>>>>>>          Kim's repro sequence for JDK-8129978
>>>>>>>>>          JPRT -testset hotspot
>>>>>>>>>
>>>>>>>>> Changes between round 0 and round 1:
>>>>>>>>>
>>>>>>>>> - add an 'is_safe' parameter to the OM_PERFDATA_OP macro;
>>>>>>>>>   safepoint-safe callers can access _has_PerfData flag directly;
>>>>>>>>>   non-safepoint-safe callers use a load-acquire to fetch the
>>>>>>>>>   current _has_PerfData flag value
>>>>>>>>> - change PerfDataManager::destroy() to simply set _has_PerfData
>>>>>>>>>   to zero (field is volatile) and then use a fence() to prevent
>>>>>>>>>   any reordering of operations in any direction; it's only done
>>>>>>>>>   once during VM shutdown so...
>>>>>>>>> - change perfMemory_exit() to only call 
>>>>>>>>> PerfDataManager::destroy()
>>>>>>>>>   when called at a safepoint and when the StatSample is not
>>>>>>>>>   running; this means when the VM is aborting, we no longer have
>>>>>>>>>   a race between the original crash report and this code path.
>>>>>>>>>
>>>>>>>>> I believe that I've addressed all comments from round 0.
>>>>>>>>>
>>>>>>>>> Thanks, in advance, for any comments, questions or suggestions.
>>>>>>>>>
>>>>>>>>> Dan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 8/25/15 3:08 PM, Daniel D. Daugherty wrote:
>>>>>>>>>> Greetings,
>>>>>>>>>>
>>>>>>>>>> I have a "fix" for a long standing race between JVM shutdown
>>>>>>>>>> and the
>>>>>>>>>> JVM statistics subsystem:
>>>>>>>>>>
>>>>>>>>>> JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
>>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8049304
>>>>>>>>>>
>>>>>>>>>> Webrev URL:
>>>>>>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/0-jdk9-hs-rt/
>>>>>>>>>>
>>>>>>>>>> Testing: Aurora Adhoc RT-SVC nightly batch
>>>>>>>>>>          Aurora Adhoc vm.tmtools batch
>>>>>>>>>>          Kim's repro sequence for JDK-8049304
>>>>>>>>>>          Kim's repro sequence for JDK-8129978
>>>>>>>>>>          JPRT -testset hotspot
>>>>>>>>>>
>>>>>>>>>> This "fix":
>>>>>>>>>>
>>>>>>>>>> - adds a volatile flag to record whether PerfDataManager is
>>>>>>>>>> holding
>>>>>>>>>>   data (PerfData objects)
>>>>>>>>>> - adds PerfDataManager::has_PerfData() to return the flag
>>>>>>>>>> - changes the Java monitor subsystem's use of PerfData to
>>>>>>>>>>   check both allocation of the monitor subsystem specific
>>>>>>>>>>   PerfData object and the new PerfDataManager::has_PerfData()
>>>>>>>>>>   return value
>>>>>>>>>>
>>>>>>>>>> If the global 'UsePerfData' option is false, the system works as
>>>>>>>>>> it did before. If 'UsePerfData' is true (the default on
>>>>>>>>>> non-embedded
>>>>>>>>>> systems), the Java monitor subsystem will allocate a number of
>>>>>>>>>> PerfData objects to record information. The objects will record
>>>>>>>>>> information about Java monitor subsystem until the JVM shuts 
>>>>>>>>>> down.
>>>>>>>>>>
>>>>>>>>>> When the JVM starts to shutdown, the new PerfDataManager flag 
>>>>>>>>>> will
>>>>>>>>>> change to false and the Java monitor subsystem will stop 
>>>>>>>>>> using the
>>>>>>>>>> PerfData objects. This is the new behavior. As noted in the
>>>>>>>>>> comments
>>>>>>>>>> I added to the code, the race is still present; I'm just 
>>>>>>>>>> changing
>>>>>>>>>> the order and the timing to reduce the likelihood of the crash.
>>>>>>>>>>
>>>>>>>>>> Thanks, in advance, for any comments, questions or suggestions.
>>>>>>>>>>
>>>>>>>>>> Dan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
>