RFR (S) 8049304: race between VM_Exit and _sync_FutileWakeups->inc()

Wed Sep 2 01:43:33 UTC 2015

Hi Dan,

On 2/09/2015 2:56 AM, Daniel D. Daugherty wrote:
> On 9/1/15 8:49 AM, Daniel D. Daugherty wrote:
>> On 9/1/15 12:13 AM, David Holmes wrote:
>>> Hi Dan,
>>>
>>> Hope you had that coffee :)
>>
>> Just got that second cup... :-)

You might need to upgrade to something stronger :)

>>>
>>> On 1/09/2015 8:51 AM, Daniel D. Daugherty wrote:
>>>> Greetings,
>>>>
>>>> I've updated the "fix" for this bug based on code review comments
>>>> received in round 0.
>>>>
>>>> JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
>>>> https://bugs.openjdk.java.net/browse/JDK-8049304
>>>>
>>>> Webrev URL:
>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/1-jdk9-hs-rt/
>>>>
>>>> The easiest way to re-review is to download the two patch files
>>>> and view them in your favorite file merge tool:
>>>>
>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/0-jdk9-hs-rt/hotspot.patch
>>>>
>>>>
>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/1-jdk9-hs-rt/hotspot.patch
>>>>
>>>>
>>>>
>>>> Testing: Aurora Adhoc RT-SVC nightly batch (in process)
>>>>           Aurora Adhoc vm.tmtools batch (in process)
>>>>           Kim's repro sequence for JDK-8049304
>>>>           Kim's repro sequence for JDK-8129978
>>>>           JPRT -testset hotspot
>>>>
>>>> Changes between round 0 and round 1:
>>>>
>>>> - add an 'is_safe' parameter to the OM_PERFDATA_OP macro;
>>>
>>> The sense of safepoint-safe is a bit confusing to me. If the thread
>>> is in a safepoint-safe-state then it seems safepoint-safe==false ??
>>
>> I went back and forth on the naming of this parameter. When 'is_safe'
>> is true, the calling thread can safely and directly check has_PerfData
>> because we are not at a safepoint so we can't be racing with a normal
>> exit. In a normal exit, has_PerfData is only changed (and memory freed)
>> at a safepoint.
>>
>> When 'is_safe' is false, then the code path we're on could be executing
>> in parallel with a safepoint so we need to be more careful with accessing
>> the has_PerfData flag...
>>
>> If you have a better name for the parameter...

I think not referring to it as "safepoint-safe" at the call sites would 
help, but as per below I think is-safe completely unnecessary.

>>>
>>>>    safepoint-safe callers can access _has_PerfData flag directly;
>>>>    non-safepoint-safe callers use a load-acquire to fetch the
>>>>    current _has_PerfData flag value
>>>
>>> load-acquire makes no difference to the "currentness" of
>>> _has_PerfData and has no affect on the size of the "race window". So
>>> I don't see any point to this is-safe distinction.
>>
>> Hmmm... I think I got this idea when I re-read orderAccess.hpp where
>> it talks about release-store and load-acquire pairs. I originally
>> had the transition of _has_PerfData from '1' -> '0' done by a
>> release_store() so I was using a load_acquire() as part of the
>> pair. I've since switched to a direct setting of "_has_PerfData = 0"
>> followed by the big hammer fence(), but I think the load_acquire()
>> still pairs with that fence().
>
> I rewrote the comment above the macro:
>
> $ hg diff src/share/vm/runtime/objectMonitor.hpp
> diff -r efc17f03e5d4 src/share/vm/runtime/objectMonitor.hpp
> --- a/src/share/vm/runtime/objectMonitor.hpp    Thu Aug 20 10:58:57 2015
> -0700
> +++ b/src/share/vm/runtime/objectMonitor.hpp    Tue Sep 01 09:54:53 2015
> -0700
> @@ -177,6 +177,33 @@ class ObjectMonitor {
>
>    public:
>     static void Initialize();
> +
> +  // Only perform a PerfData operation if the PerfData object has been
> +  // allocated and if the PerfDataManager has not freed the PerfData
> +  // objects which can happen at normal VM shutdown. If is_safe == true
> +  // then the calling context does not execute in parallel with a
> +  // safepoint and we can simply query the flag. Otherwise, the calling
> +  // context may be executing in parallel with a safepoint so we use
> +  // OrderAccess::load_acquire() to pair with the code that may be
> +  // setting the has_PerfData flag to false.

The whole point of the release/acquire pairings is to ensure that when 
you do the load_acquire and see the magic value (be it 1 or 0) then 
everything prior to the release_store that set that magic value, is also 
visible. But here when we set _has_PerfData=0 there is nothing prior to 
that that the code checking _hs_PerfData==0 cares about. The normal 
usage pattern has the form:

writer:
   initialize_data();
   release_store(dataReady, true)

reader:
   if (load_acquire(dataready) == true)
     safely_use_data();

but that isn't the pattern here, so the load_acquire (and the fence()) 
serves no actual purpose. So the whole is-safe distinction really serves 
no purpose in my opinion.

The reason why I object more strongly to the fence() versus the 
release_store(), is that use of the fence() seems (to me) a stronger 
statement of requirement "hey we really do need this full bidirectional 
fence". Further originally I hadn't realized what Tom(?) pointed out 
regarding the fact that the release_store puts the barrier before the 
store and we want the barrier after the store. Then followed the 
discussion that what we need is storeload|storestore to maintain the 
ordering (ignore the os::sleep()). But that somehow has morphed into a 
full fence() - which I don't see the need for.

Cheers,
David
-----

> +  //
> +  #define OM_PERFDATA_OP(f, op_str, is_safe)                        \
> +    do {                                                            \
> +      if (ObjectMonitor::_sync_ ## f != NULL) {                     \
> +        bool do_the_op = false;                                     \
> +        if (is_safe) {                                              \
> +          if (PerfDataManager::has_PerfData()) {                    \
> +            do_the_op = true;                                       \
> + }                                                         \
> +        } else if (PerfDataManager::has_PerfData_with_acquire()) {  \
> +          do_the_op = true;                                         \
> + }                                                           \
> +        if (do_the_op) {                                            \
> +          ObjectMonitor::_sync_ ## f->op_str;                       \
> + }                                                           \
> + }                                                             \
> +    } while (0)
> +
>     static PerfCounter * _sync_ContendedLockAttempts;
>     static PerfCounter * _sync_FutileWakeups;
>     static PerfCounter * _sync_Parks;
>
>
>
> Hopefully, you like this version better?
>
> Dan
>
>
>>
>>
>>>
>>>> - change PerfDataManager::destroy() to simply set _has_PerfData
>>>>    to zero (field is volatile) and then use a fence() to prevent
>>>>    any reordering of operations in any direction; it's only done
>>>>    once during VM shutdown so...
>>>
>>> See previous email :)
>>
>> Yup. I did and I'd still like to keep it for the reasons stated
>> in my reply to the previous e-mail. You were good with this code
>> as a release_store() of '0' so I kind of expected you to be OK
>> with the stronger fence().
>>
>>
>>>
>>>> - change perfMemory_exit() to only call PerfDataManager::destroy()
>>>>    when called at a safepoint and when the StatSample is not
>>>>    running; this means when the VM is aborting, we no longer have
>>>>    a race between the original crash report and this code path.
>>>
>>> Sounds reasonable.
>>
>> Thanks. I hope it makes Kim happy also.
>>
>>
>>>
>>> I have few specific comments/nits:
>>>
>>> src/share/vm/runtime/perfData.cpp
>>>
>>>  43 volatile jint   PerfDataManager::_has_PerfData = false;
>>>
>>> Nit: jint should be initialized with 0 not false.
>>
>> Will fix; oops...that made it past round 0...
>>
>>
>>>
>>>  287   // manipulation before we free the memory. The two alternatives
>>>  288   // are 1) leak the PerfData memory or 2) do some form of ordered
>>>  289   // access before every PerfData operation.
>>>
>>> No amount of ordered access will fix the race - only synchronization
>>> of some form.
>>
>> I thought I had changed that comment. Will fix.
>>
>>
>>>
>>>  315 void PerfDataManager::add_item(PerfData* p, bool sampled) {
>>>  316
>>>  317   MutexLocker ml(PerfDataManager_lock);
>>>  318
>>>  319   if (_all == NULL) {
>>>  320     _all = new PerfDataList(100);
>>>  321     OrderAccess::release_store(&_has_PerfData, 1);
>>>  322   }
>>>
>>> I can't see any purpose for a release_store here. You have one-time
>>> initialization code guarded with a mutex. A release_store adds
>>> nothing here - other than future head-scratching from maintainers of
>>> this code.
>>
>> Agreed. Not sure how I missed the fact that we were in the
>> locked block.  oops...that made it past round 0...
>>
>> Thanks again for the review. Will generate another webrev...
>>
>> Dan
>>
>>
>>>
>>> Thanks,
>>> David
>>>
>>>> I believe that I've addressed all comments from round 0.
>>>>
>>>> Thanks, in advance, for any comments, questions or suggestions.
>>>>
>>>> Dan
>>>>
>>>>
>>>> On 8/25/15 3:08 PM, Daniel D. Daugherty wrote:
>>>>> Greetings,
>>>>>
>>>>> I have a "fix" for a long standing race between JVM shutdown and the
>>>>> JVM statistics subsystem:
>>>>>
>>>>> JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
>>>>> https://bugs.openjdk.java.net/browse/JDK-8049304
>>>>>
>>>>> Webrev URL:
>>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/0-jdk9-hs-rt/
>>>>>
>>>>> Testing: Aurora Adhoc RT-SVC nightly batch
>>>>>          Aurora Adhoc vm.tmtools batch
>>>>>          Kim's repro sequence for JDK-8049304
>>>>>          Kim's repro sequence for JDK-8129978
>>>>>          JPRT -testset hotspot
>>>>>
>>>>> This "fix":
>>>>>
>>>>> - adds a volatile flag to record whether PerfDataManager is holding
>>>>>   data (PerfData objects)
>>>>> - adds PerfDataManager::has_PerfData() to return the flag
>>>>> - changes the Java monitor subsystem's use of PerfData to
>>>>>   check both allocation of the monitor subsystem specific
>>>>>   PerfData object and the new PerfDataManager::has_PerfData()
>>>>>   return value
>>>>>
>>>>> If the global 'UsePerfData' option is false, the system works as
>>>>> it did before. If 'UsePerfData' is true (the default on non-embedded
>>>>> systems), the Java monitor subsystem will allocate a number of
>>>>> PerfData objects to record information. The objects will record
>>>>> information about Java monitor subsystem until the JVM shuts down.
>>>>>
>>>>> When the JVM starts to shutdown, the new PerfDataManager flag will
>>>>> change to false and the Java monitor subsystem will stop using the
>>>>> PerfData objects. This is the new behavior. As noted in the comments
>>>>> I added to the code, the race is still present; I'm just changing
>>>>> the order and the timing to reduce the likelihood of the crash.
>>>>>
>>>>> Thanks, in advance, for any comments, questions or suggestions.
>>>>>
>>>>> Dan
>>>>>
>>>>>
>>>>>
>>>>
>>
>>
>>
>