RFR (S) 8049304: race between VM_Exit and _sync_FutileWakeups->inc()
David Holmes
david.holmes at oracle.com
Wed Sep 2 01:43:33 UTC 2015
Hi Dan,
On 2/09/2015 2:56 AM, Daniel D. Daugherty wrote:
> On 9/1/15 8:49 AM, Daniel D. Daugherty wrote:
>> On 9/1/15 12:13 AM, David Holmes wrote:
>>> Hi Dan,
>>>
>>> Hope you had that coffee :)
>>
>> Just got that second cup... :-)
You might need to upgrade to something stronger :)
>>>
>>> On 1/09/2015 8:51 AM, Daniel D. Daugherty wrote:
>>>> Greetings,
>>>>
>>>> I've updated the "fix" for this bug based on code review comments
>>>> received in round 0.
>>>>
>>>> JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
>>>> https://bugs.openjdk.java.net/browse/JDK-8049304
>>>>
>>>> Webrev URL:
>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/1-jdk9-hs-rt/
>>>>
>>>> The easiest way to re-review is to download the two patch files
>>>> and view them in your favorite file merge tool:
>>>>
>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/0-jdk9-hs-rt/hotspot.patch
>>>>
>>>>
>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/1-jdk9-hs-rt/hotspot.patch
>>>>
>>>>
>>>>
>>>> Testing: Aurora Adhoc RT-SVC nightly batch (in process)
>>>> Aurora Adhoc vm.tmtools batch (in process)
>>>> Kim's repro sequence for JDK-8049304
>>>> Kim's repro sequence for JDK-8129978
>>>> JPRT -testset hotspot
>>>>
>>>> Changes between round 0 and round 1:
>>>>
>>>> - add an 'is_safe' parameter to the OM_PERFDATA_OP macro;
>>>
>>> The sense of safepoint-safe is a bit confusing to me. If the thread
>>> is in a safepoint-safe-state then it seems safepoint-safe==false ??
>>
>> I went back and forth on the naming of this parameter. When 'is_safe'
>> is true, the calling thread can safely and directly check has_PerfData
>> because we are not at a safepoint so we can't be racing with a normal
>> exit. In a normal exit, has_PerfData is only changed (and memory freed)
>> at a safepoint.
>>
>> When 'is_safe' is false, then the code path we're on could be executing
>> in parallel with a safepoint so we need to be more careful with accessing
>> the has_PerfData flag...
>>
>> If you have a better name for the parameter...
I think not referring to it as "safepoint-safe" at the call sites would
help, but as per below I think is-safe completely unnecessary.
>>>
>>>> safepoint-safe callers can access _has_PerfData flag directly;
>>>> non-safepoint-safe callers use a load-acquire to fetch the
>>>> current _has_PerfData flag value
>>>
>>> load-acquire makes no difference to the "currentness" of
>>> _has_PerfData and has no affect on the size of the "race window". So
>>> I don't see any point to this is-safe distinction.
>>
>> Hmmm... I think I got this idea when I re-read orderAccess.hpp where
>> it talks about release-store and load-acquire pairs. I originally
>> had the transition of _has_PerfData from '1' -> '0' done by a
>> release_store() so I was using a load_acquire() as part of the
>> pair. I've since switched to a direct setting of "_has_PerfData = 0"
>> followed by the big hammer fence(), but I think the load_acquire()
>> still pairs with that fence().
>
> I rewrote the comment above the macro:
>
> $ hg diff src/share/vm/runtime/objectMonitor.hpp
> diff -r efc17f03e5d4 src/share/vm/runtime/objectMonitor.hpp
> --- a/src/share/vm/runtime/objectMonitor.hpp Thu Aug 20 10:58:57 2015
> -0700
> +++ b/src/share/vm/runtime/objectMonitor.hpp Tue Sep 01 09:54:53 2015
> -0700
> @@ -177,6 +177,33 @@ class ObjectMonitor {
>
> public:
> static void Initialize();
> +
> + // Only perform a PerfData operation if the PerfData object has been
> + // allocated and if the PerfDataManager has not freed the PerfData
> + // objects which can happen at normal VM shutdown. If is_safe == true
> + // then the calling context does not execute in parallel with a
> + // safepoint and we can simply query the flag. Otherwise, the calling
> + // context may be executing in parallel with a safepoint so we use
> + // OrderAccess::load_acquire() to pair with the code that may be
> + // setting the has_PerfData flag to false.
The whole point of the release/acquire pairings is to ensure that when
you do the load_acquire and see the magic value (be it 1 or 0) then
everything prior to the release_store that set that magic value, is also
visible. But here when we set _has_PerfData=0 there is nothing prior to
that that the code checking _hs_PerfData==0 cares about. The normal
usage pattern has the form:
writer:
initialize_data();
release_store(dataReady, true)
reader:
if (load_acquire(dataready) == true)
safely_use_data();
but that isn't the pattern here, so the load_acquire (and the fence())
serves no actual purpose. So the whole is-safe distinction really serves
no purpose in my opinion.
The reason why I object more strongly to the fence() versus the
release_store(), is that use of the fence() seems (to me) a stronger
statement of requirement "hey we really do need this full bidirectional
fence". Further originally I hadn't realized what Tom(?) pointed out
regarding the fact that the release_store puts the barrier before the
store and we want the barrier after the store. Then followed the
discussion that what we need is storeload|storestore to maintain the
ordering (ignore the os::sleep()). But that somehow has morphed into a
full fence() - which I don't see the need for.
Cheers,
David
-----
> + //
> + #define OM_PERFDATA_OP(f, op_str, is_safe) \
> + do { \
> + if (ObjectMonitor::_sync_ ## f != NULL) { \
> + bool do_the_op = false; \
> + if (is_safe) { \
> + if (PerfDataManager::has_PerfData()) { \
> + do_the_op = true; \
> + } \
> + } else if (PerfDataManager::has_PerfData_with_acquire()) { \
> + do_the_op = true; \
> + } \
> + if (do_the_op) { \
> + ObjectMonitor::_sync_ ## f->op_str; \
> + } \
> + } \
> + } while (0)
> +
> static PerfCounter * _sync_ContendedLockAttempts;
> static PerfCounter * _sync_FutileWakeups;
> static PerfCounter * _sync_Parks;
>
>
>
> Hopefully, you like this version better?
>
> Dan
>
>
>>
>>
>>>
>>>> - change PerfDataManager::destroy() to simply set _has_PerfData
>>>> to zero (field is volatile) and then use a fence() to prevent
>>>> any reordering of operations in any direction; it's only done
>>>> once during VM shutdown so...
>>>
>>> See previous email :)
>>
>> Yup. I did and I'd still like to keep it for the reasons stated
>> in my reply to the previous e-mail. You were good with this code
>> as a release_store() of '0' so I kind of expected you to be OK
>> with the stronger fence().
>>
>>
>>>
>>>> - change perfMemory_exit() to only call PerfDataManager::destroy()
>>>> when called at a safepoint and when the StatSample is not
>>>> running; this means when the VM is aborting, we no longer have
>>>> a race between the original crash report and this code path.
>>>
>>> Sounds reasonable.
>>
>> Thanks. I hope it makes Kim happy also.
>>
>>
>>>
>>> I have few specific comments/nits:
>>>
>>> src/share/vm/runtime/perfData.cpp
>>>
>>> 43 volatile jint PerfDataManager::_has_PerfData = false;
>>>
>>> Nit: jint should be initialized with 0 not false.
>>
>> Will fix; oops...that made it past round 0...
>>
>>
>>>
>>> 287 // manipulation before we free the memory. The two alternatives
>>> 288 // are 1) leak the PerfData memory or 2) do some form of ordered
>>> 289 // access before every PerfData operation.
>>>
>>> No amount of ordered access will fix the race - only synchronization
>>> of some form.
>>
>> I thought I had changed that comment. Will fix.
>>
>>
>>>
>>> 315 void PerfDataManager::add_item(PerfData* p, bool sampled) {
>>> 316
>>> 317 MutexLocker ml(PerfDataManager_lock);
>>> 318
>>> 319 if (_all == NULL) {
>>> 320 _all = new PerfDataList(100);
>>> 321 OrderAccess::release_store(&_has_PerfData, 1);
>>> 322 }
>>>
>>> I can't see any purpose for a release_store here. You have one-time
>>> initialization code guarded with a mutex. A release_store adds
>>> nothing here - other than future head-scratching from maintainers of
>>> this code.
>>
>> Agreed. Not sure how I missed the fact that we were in the
>> locked block. oops...that made it past round 0...
>>
>> Thanks again for the review. Will generate another webrev...
>>
>> Dan
>>
>>
>>>
>>> Thanks,
>>> David
>>>
>>>> I believe that I've addressed all comments from round 0.
>>>>
>>>> Thanks, in advance, for any comments, questions or suggestions.
>>>>
>>>> Dan
>>>>
>>>>
>>>> On 8/25/15 3:08 PM, Daniel D. Daugherty wrote:
>>>>> Greetings,
>>>>>
>>>>> I have a "fix" for a long standing race between JVM shutdown and the
>>>>> JVM statistics subsystem:
>>>>>
>>>>> JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
>>>>> https://bugs.openjdk.java.net/browse/JDK-8049304
>>>>>
>>>>> Webrev URL:
>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/0-jdk9-hs-rt/
>>>>>
>>>>> Testing: Aurora Adhoc RT-SVC nightly batch
>>>>> Aurora Adhoc vm.tmtools batch
>>>>> Kim's repro sequence for JDK-8049304
>>>>> Kim's repro sequence for JDK-8129978
>>>>> JPRT -testset hotspot
>>>>>
>>>>> This "fix":
>>>>>
>>>>> - adds a volatile flag to record whether PerfDataManager is holding
>>>>> data (PerfData objects)
>>>>> - adds PerfDataManager::has_PerfData() to return the flag
>>>>> - changes the Java monitor subsystem's use of PerfData to
>>>>> check both allocation of the monitor subsystem specific
>>>>> PerfData object and the new PerfDataManager::has_PerfData()
>>>>> return value
>>>>>
>>>>> If the global 'UsePerfData' option is false, the system works as
>>>>> it did before. If 'UsePerfData' is true (the default on non-embedded
>>>>> systems), the Java monitor subsystem will allocate a number of
>>>>> PerfData objects to record information. The objects will record
>>>>> information about Java monitor subsystem until the JVM shuts down.
>>>>>
>>>>> When the JVM starts to shutdown, the new PerfDataManager flag will
>>>>> change to false and the Java monitor subsystem will stop using the
>>>>> PerfData objects. This is the new behavior. As noted in the comments
>>>>> I added to the code, the race is still present; I'm just changing
>>>>> the order and the timing to reduce the likelihood of the crash.
>>>>>
>>>>> Thanks, in advance, for any comments, questions or suggestions.
>>>>>
>>>>> Dan
>>>>>
>>>>>
>>>>>
>>>>
>>
>>
>>
>
More information about the hotspot-runtime-dev
mailing list