RFR (S) 8049304: race between VM_Exit and _sync_FutileWakeups->inc()

Wed Sep 2 16:49:59 UTC 2015

On 02/09/2015 18:21, Daniel D. Daugherty wrote:
> Bertrand,
>
> Welcome to the party!  Replies embedded below...
>
>
> On 9/2/15 10:05 AM, Bertrand Delsart wrote:
>> On 02/09/2015 17:15, Daniel D. Daugherty wrote:
>>> On 9/2/15 8:49 AM, Tom Benson wrote:
>>>> Hi,
>>>> I'm a bit confused on one point... Since you now only call
>>>> PerfDataManager::destroy at a safepoint, why do you still have the
>>>> comment about 'the race'  and the sleep?
>>>
>>> Because the two "futile wakeup" counter updates in the monitor
>>> subsystem can execute in parallel with a safepoint. The JavaThread
>>> state is "blocked" so the safepoint subsystem will see the JavaThread
>>> as "at a safepoint" when it is actually executing the code to
>>> increment the counter.
>>>
>>> That's what the "is_safe" parameter to the OM_PERFDATA_OP macro was
>>> all about in the round 1 code review. However, David convinced me
>>> that all that logic didn't guarantee we wouldn't hit the race so
>>> I ripped it all out in the round 2 code review (this one).
>>
>> I did not look at round 1 but I agree that it would be hard to find a
>> logic that guarantees we wouldn't hit the race.
>
> Agreed. We would have to add synchronization and we don't want to
> do that due to the impact on all PerfData usage.
>
>
>> The proposed changeset is a good step toward reducing its likelihood.
>> Even more importantly, what I do like in the proposed webrev is the
>> fact that all the logic is now mostly in a macro + the destroy code,
>> which we can easily refine later.
>
> Thanks.
>
>
>> This is IMHO a good step forward (and a nice work) and the reason why
>> I would approve the changeset as is.
>
> Thanks and I'll list you as a reviewer.
>
>
>> One non mandatory improvement is to further reduce the likelihood of
>> the bug by doing the PerfDataManager::destroy() in two steps. Once
>> executed ASAP in the shutdown process to set _has_PerfData to 0 and
>> one executed as late as possible to perform the cleanup we delayed.
>> That may be safer than relying only on the 'os::naked_short_sleep(1);'
>> (which we can still put at the beginning of the delayed method).
>
> I think we bounced that idea around a bit in off-thread discussions
> before I really spun up on this bug. I believe the conclusion was
> that would add more confusion to the VM exit/shutdown procedure and
> it's complicated enough as it is...
>
> Are you good with this changeset modulo the comment additions I
> just posted in my most recent reply to Tom?

Approved.

Bertrand (not a Reviewer)

>
> Dan
>
>
>>
>> Regards,
>>
>> Bertrand.
>>
>>>
>>> Does this help your confusion?
>>>
>>> Dan
>>>
>>>
>>>> Tom
>>>>
>>>> On 9/2/2015 7:52 AM, Daniel D. Daugherty wrote:
>>>>> David,
>>>>>
>>>>> Thanks for the very fast re-review!
>>>>>
>>>>> Enjoy your vacation!
>>>>>
>>>>> Dan
>>>>>
>>>>>
>>>>> On 9/2/15 2:54 AM, David Holmes wrote:
>>>>>> Hi Dan,
>>>>>>
>>>>>> On 2/09/2015 2:45 PM, Daniel D. Daugherty wrote:
>>>>>>> Greetings,
>>>>>>>
>>>>>>> I've updated the "fix" for this bug based on code review comments
>>>>>>> received in round 1. I've dropped most of the changes from round 1
>>>>>>> with a couple of exceptions.
>>>>>>
>>>>>> I have no further comments - it all looks good to me. If others want
>>>>>> the pendulum to swing back a little from this position then ...
>>>>>> nothing that has been suggested is functionally wrong. :)
>>>>>>
>>>>>> Thanks,
>>>>>> David
>>>>>> -----
>>>>>>
>>>>>> PS. When you get back from vacation I'll be gone for a month. That
>>>>>> gives you a large window to push other things through with less
>>>>>> stress ;-)
>>>>>>
>>>>>>
>>>>>>> JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8049304
>>>>>>>
>>>>>>> Webrev URL:
>>>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/2-jdk9-hs-rt/
>>>>>>>
>>>>>>> The easiest way to re-review is to download the two patch files
>>>>>>> (round 0 and round 2) and view them in your favorite file merge
>>>>>>> tool:
>>>>>>>
>>>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/0-jdk9-hs-rt/hotspot.patch
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/2-jdk9-hs-rt/hotspot.patch
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Testing: Aurora Adhoc RT-SVC nightly batch (in process)
>>>>>>>           Aurora Adhoc vm.tmtools batch (in process)
>>>>>>>           Kim's repro sequence for JDK-8049304
>>>>>>>           Kim's repro sequence for JDK-8129978
>>>>>>>           JPRT -testset hotspot
>>>>>>>
>>>>>>> Changes between round 0 and round 2:
>>>>>>>
>>>>>>> - clarify a few comments
>>>>>>> - init _has_PerfData flag with '0' (instead of false)
>>>>>>> - drop unnecessary use OrderAccess::release_store() to set
>>>>>>>    _has_PerfData to '1' (we're in a Mutex)
>>>>>>> - change perfMemory_exit() to only call PerfDataManager::destroy()
>>>>>>>    when called at a safepoint and when the StatSampler is not
>>>>>>>    running; this means when the VM is aborting, we no longer have
>>>>>>>    a race between the original crash report and this code path.
>>>>>>>
>>>>>>> Changes between round 1 and round 2:
>>>>>>>
>>>>>>> - clarify a few comments
>>>>>>> - drop is_safe parameter to OM_PERFDATA_OP macro
>>>>>>> - init _has_PerfData flag with '0' (instead of false)
>>>>>>> - drop OrderAccess::fence() call before os::naked_short_sleep() call
>>>>>>> - drop PerfDataManager::has_PerfData_with_acquire()
>>>>>>> - drop unnecessary use OrderAccess::release_store() to set
>>>>>>>    _has_PerfData to '1' (we're in a Mutex)
>>>>>>>
>>>>>>> I believe that I've addressed all comments from round 0 and
>>>>>>> from round 1.
>>>>>>>
>>>>>>> Thanks, in advance, for any comments, questions or suggestions.
>>>>>>>
>>>>>>> Dan
>>>>>>>
>>>>>>>
>>>>>>> On 8/31/15 4:51 PM, Daniel D. Daugherty wrote:
>>>>>>>> Greetings,
>>>>>>>>
>>>>>>>> I've updated the "fix" for this bug based on code review comments
>>>>>>>> received in round 0.
>>>>>>>>
>>>>>>>> JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8049304
>>>>>>>>
>>>>>>>> Webrev URL:
>>>>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/1-jdk9-hs-rt/
>>>>>>>>
>>>>>>>> The easiest way to re-review is to download the two patch files
>>>>>>>> and view them in your favorite file merge tool:
>>>>>>>>
>>>>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/0-jdk9-hs-rt/hotspot.patch
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/1-jdk9-hs-rt/hotspot.patch
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Testing: Aurora Adhoc RT-SVC nightly batch (in process)
>>>>>>>>          Aurora Adhoc vm.tmtools batch (in process)
>>>>>>>>          Kim's repro sequence for JDK-8049304
>>>>>>>>          Kim's repro sequence for JDK-8129978
>>>>>>>>          JPRT -testset hotspot
>>>>>>>>
>>>>>>>> Changes between round 0 and round 1:
>>>>>>>>
>>>>>>>> - add an 'is_safe' parameter to the OM_PERFDATA_OP macro;
>>>>>>>>   safepoint-safe callers can access _has_PerfData flag directly;
>>>>>>>>   non-safepoint-safe callers use a load-acquire to fetch the
>>>>>>>>   current _has_PerfData flag value
>>>>>>>> - change PerfDataManager::destroy() to simply set _has_PerfData
>>>>>>>>   to zero (field is volatile) and then use a fence() to prevent
>>>>>>>>   any reordering of operations in any direction; it's only done
>>>>>>>>   once during VM shutdown so...
>>>>>>>> - change perfMemory_exit() to only call PerfDataManager::destroy()
>>>>>>>>   when called at a safepoint and when the StatSample is not
>>>>>>>>   running; this means when the VM is aborting, we no longer have
>>>>>>>>   a race between the original crash report and this code path.
>>>>>>>>
>>>>>>>> I believe that I've addressed all comments from round 0.
>>>>>>>>
>>>>>>>> Thanks, in advance, for any comments, questions or suggestions.
>>>>>>>>
>>>>>>>> Dan
>>>>>>>>
>>>>>>>>
>>>>>>>> On 8/25/15 3:08 PM, Daniel D. Daugherty wrote:
>>>>>>>>> Greetings,
>>>>>>>>>
>>>>>>>>> I have a "fix" for a long standing race between JVM shutdown
>>>>>>>>> and the
>>>>>>>>> JVM statistics subsystem:
>>>>>>>>>
>>>>>>>>> JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
>>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8049304
>>>>>>>>>
>>>>>>>>> Webrev URL:
>>>>>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/0-jdk9-hs-rt/
>>>>>>>>>
>>>>>>>>> Testing: Aurora Adhoc RT-SVC nightly batch
>>>>>>>>>          Aurora Adhoc vm.tmtools batch
>>>>>>>>>          Kim's repro sequence for JDK-8049304
>>>>>>>>>          Kim's repro sequence for JDK-8129978
>>>>>>>>>          JPRT -testset hotspot
>>>>>>>>>
>>>>>>>>> This "fix":
>>>>>>>>>
>>>>>>>>> - adds a volatile flag to record whether PerfDataManager is
>>>>>>>>> holding
>>>>>>>>>   data (PerfData objects)
>>>>>>>>> - adds PerfDataManager::has_PerfData() to return the flag
>>>>>>>>> - changes the Java monitor subsystem's use of PerfData to
>>>>>>>>>   check both allocation of the monitor subsystem specific
>>>>>>>>>   PerfData object and the new PerfDataManager::has_PerfData()
>>>>>>>>>   return value
>>>>>>>>>
>>>>>>>>> If the global 'UsePerfData' option is false, the system works as
>>>>>>>>> it did before. If 'UsePerfData' is true (the default on
>>>>>>>>> non-embedded
>>>>>>>>> systems), the Java monitor subsystem will allocate a number of
>>>>>>>>> PerfData objects to record information. The objects will record
>>>>>>>>> information about Java monitor subsystem until the JVM shuts down.
>>>>>>>>>
>>>>>>>>> When the JVM starts to shutdown, the new PerfDataManager flag will
>>>>>>>>> change to false and the Java monitor subsystem will stop using the
>>>>>>>>> PerfData objects. This is the new behavior. As noted in the
>>>>>>>>> comments
>>>>>>>>> I added to the code, the race is still present; I'm just changing
>>>>>>>>> the order and the timing to reduce the likelihood of the crash.
>>>>>>>>>
>>>>>>>>> Thanks, in advance, for any comments, questions or suggestions.
>>>>>>>>>
>>>>>>>>> Dan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
>>
>

-- 
Bertrand Delsart,                     Grenoble Engineering Center
Oracle,         180 av. de l'Europe,          ZIRST de Montbonnot
38330 Montbonnot Saint Martin,                             FRANCE
bertrand.delsart at oracle.com             Phone : +33 4 76 18 81 23

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NOTICE: This email message is for the sole use of the intended
recipient(s) and may contain confidential and privileged
information. Any unauthorized review, use, disclosure or
distribution is prohibited. If you are not the intended recipient,
please contact the sender by reply email and destroy all copies of
the original message.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~