RFR (S) 8049304: race between VM_Exit and _sync_FutileWakeups->inc()

Tue Sep 1 16:56:47 UTC 2015

On 9/1/15 8:49 AM, Daniel D. Daugherty wrote:
> On 9/1/15 12:13 AM, David Holmes wrote:
>> Hi Dan,
>>
>> Hope you had that coffee :)
>
> Just got that second cup... :-)
>
>
>>
>> On 1/09/2015 8:51 AM, Daniel D. Daugherty wrote:
>>> Greetings,
>>>
>>> I've updated the "fix" for this bug based on code review comments
>>> received in round 0.
>>>
>>> JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
>>> https://bugs.openjdk.java.net/browse/JDK-8049304
>>>
>>> Webrev URL: 
>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/1-jdk9-hs-rt/
>>>
>>> The easiest way to re-review is to download the two patch files
>>> and view them in your favorite file merge tool:
>>>
>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/0-jdk9-hs-rt/hotspot.patch 
>>>
>>>
>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/1-jdk9-hs-rt/hotspot.patch 
>>>
>>>
>>>
>>> Testing: Aurora Adhoc RT-SVC nightly batch (in process)
>>>           Aurora Adhoc vm.tmtools batch (in process)
>>>           Kim's repro sequence for JDK-8049304
>>>           Kim's repro sequence for JDK-8129978
>>>           JPRT -testset hotspot
>>>
>>> Changes between round 0 and round 1:
>>>
>>> - add an 'is_safe' parameter to the OM_PERFDATA_OP macro;
>>
>> The sense of safepoint-safe is a bit confusing to me. If the thread 
>> is in a safepoint-safe-state then it seems safepoint-safe==false ??
>
> I went back and forth on the naming of this parameter. When 'is_safe'
> is true, the calling thread can safely and directly check has_PerfData
> because we are not at a safepoint so we can't be racing with a normal
> exit. In a normal exit, has_PerfData is only changed (and memory freed)
> at a safepoint.
>
> When 'is_safe' is false, then the code path we're on could be executing
> in parallel with a safepoint so we need to be more careful with accessing
> the has_PerfData flag...
>
> If you have a better name for the parameter...
>
>
>>
>>>    safepoint-safe callers can access _has_PerfData flag directly;
>>>    non-safepoint-safe callers use a load-acquire to fetch the
>>>    current _has_PerfData flag value
>>
>> load-acquire makes no difference to the "currentness" of 
>> _has_PerfData and has no affect on the size of the "race window". So 
>> I don't see any point to this is-safe distinction.
>
> Hmmm... I think I got this idea when I re-read orderAccess.hpp where
> it talks about release-store and load-acquire pairs. I originally
> had the transition of _has_PerfData from '1' -> '0' done by a
> release_store() so I was using a load_acquire() as part of the
> pair. I've since switched to a direct setting of "_has_PerfData = 0"
> followed by the big hammer fence(), but I think the load_acquire()
> still pairs with that fence().

I rewrote the comment above the macro:

$ hg diff src/share/vm/runtime/objectMonitor.hpp
diff -r efc17f03e5d4 src/share/vm/runtime/objectMonitor.hpp

--- a/src/share/vm/runtime/objectMonitor.hpp    Thu Aug 20 10:58:57 2015 
-0700
+++ b/src/share/vm/runtime/objectMonitor.hpp    Tue Sep 01 09:54:53 2015 
-0700
@@ -177,6 +177,33 @@ class ObjectMonitor {

   public:
    static void Initialize();
+
+  // Only perform a PerfData operation if the PerfData object has been
+  // allocated and if the PerfDataManager has not freed the PerfData
+  // objects which can happen at normal VM shutdown. If is_safe == true
+  // then the calling context does not execute in parallel with a
+  // safepoint and we can simply query the flag. Otherwise, the calling
+  // context may be executing in parallel with a safepoint so we use
+  // OrderAccess::load_acquire() to pair with the code that may be
+  // setting the has_PerfData flag to false.
+  //
+  #define OM_PERFDATA_OP(f, op_str, is_safe)                        \
+    do {                                                            \
+      if (ObjectMonitor::_sync_ ## f != NULL) {                     \
+        bool do_the_op = false;                                     \
+        if (is_safe) {                                              \
+          if (PerfDataManager::has_PerfData()) {                    \
+            do_the_op = true;                                       \
+ }                                                         \
+        } else if (PerfDataManager::has_PerfData_with_acquire()) {  \
+          do_the_op = true;                                         \
+ }                                                           \
+        if (do_the_op) {                                            \
+          ObjectMonitor::_sync_ ## f->op_str;                       \
+ }                                                           \
+ }                                                             \
+    } while (0)
+
    static PerfCounter * _sync_ContendedLockAttempts;
    static PerfCounter * _sync_FutileWakeups;
    static PerfCounter * _sync_Parks;



Hopefully, you like this version better?

Dan


>
>
>>
>>> - change PerfDataManager::destroy() to simply set _has_PerfData
>>>    to zero (field is volatile) and then use a fence() to prevent
>>>    any reordering of operations in any direction; it's only done
>>>    once during VM shutdown so...
>>
>> See previous email :)
>
> Yup. I did and I'd still like to keep it for the reasons stated
> in my reply to the previous e-mail. You were good with this code
> as a release_store() of '0' so I kind of expected you to be OK
> with the stronger fence().
>
>
>>
>>> - change perfMemory_exit() to only call PerfDataManager::destroy()
>>>    when called at a safepoint and when the StatSample is not
>>>    running; this means when the VM is aborting, we no longer have
>>>    a race between the original crash report and this code path.
>>
>> Sounds reasonable.
>
> Thanks. I hope it makes Kim happy also.
>
>
>>
>> I have few specific comments/nits:
>>
>> src/share/vm/runtime/perfData.cpp
>>
>>  43 volatile jint   PerfDataManager::_has_PerfData = false;
>>
>> Nit: jint should be initialized with 0 not false.
>
> Will fix; oops...that made it past round 0...
>
>
>>
>>  287   // manipulation before we free the memory. The two alternatives
>>  288   // are 1) leak the PerfData memory or 2) do some form of ordered
>>  289   // access before every PerfData operation.
>>
>> No amount of ordered access will fix the race - only synchronization 
>> of some form.
>
> I thought I had changed that comment. Will fix.
>
>
>>
>>  315 void PerfDataManager::add_item(PerfData* p, bool sampled) {
>>  316
>>  317   MutexLocker ml(PerfDataManager_lock);
>>  318
>>  319   if (_all == NULL) {
>>  320     _all = new PerfDataList(100);
>>  321     OrderAccess::release_store(&_has_PerfData, 1);
>>  322   }
>>
>> I can't see any purpose for a release_store here. You have one-time 
>> initialization code guarded with a mutex. A release_store adds 
>> nothing here - other than future head-scratching from maintainers of 
>> this code.
>
> Agreed. Not sure how I missed the fact that we were in the
> locked block.  oops...that made it past round 0...
>
> Thanks again for the review. Will generate another webrev...
>
> Dan
>
>
>>
>> Thanks,
>> David
>>
>>> I believe that I've addressed all comments from round 0.
>>>
>>> Thanks, in advance, for any comments, questions or suggestions.
>>>
>>> Dan
>>>
>>>
>>> On 8/25/15 3:08 PM, Daniel D. Daugherty wrote:
>>>> Greetings,
>>>>
>>>> I have a "fix" for a long standing race between JVM shutdown and the
>>>> JVM statistics subsystem:
>>>>
>>>> JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
>>>> https://bugs.openjdk.java.net/browse/JDK-8049304
>>>>
>>>> Webrev URL:
>>>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/0-jdk9-hs-rt/
>>>>
>>>> Testing: Aurora Adhoc RT-SVC nightly batch
>>>>          Aurora Adhoc vm.tmtools batch
>>>>          Kim's repro sequence for JDK-8049304
>>>>          Kim's repro sequence for JDK-8129978
>>>>          JPRT -testset hotspot
>>>>
>>>> This "fix":
>>>>
>>>> - adds a volatile flag to record whether PerfDataManager is holding
>>>>   data (PerfData objects)
>>>> - adds PerfDataManager::has_PerfData() to return the flag
>>>> - changes the Java monitor subsystem's use of PerfData to
>>>>   check both allocation of the monitor subsystem specific
>>>>   PerfData object and the new PerfDataManager::has_PerfData()
>>>>   return value
>>>>
>>>> If the global 'UsePerfData' option is false, the system works as
>>>> it did before. If 'UsePerfData' is true (the default on non-embedded
>>>> systems), the Java monitor subsystem will allocate a number of
>>>> PerfData objects to record information. The objects will record
>>>> information about Java monitor subsystem until the JVM shuts down.
>>>>
>>>> When the JVM starts to shutdown, the new PerfDataManager flag will
>>>> change to false and the Java monitor subsystem will stop using the
>>>> PerfData objects. This is the new behavior. As noted in the comments
>>>> I added to the code, the race is still present; I'm just changing
>>>> the order and the timing to reduce the likelihood of the crash.
>>>>
>>>> Thanks, in advance, for any comments, questions or suggestions.
>>>>
>>>> Dan
>>>>
>>>>
>>>>
>>>
>
>
>