RFR (S) 8049304: race between VM_Exit and _sync_FutileWakeups->inc()

Thu Aug 27 15:51:19 UTC 2015

Hi David!

Thanks for chiming in on this thread!

Replies embedded below as usual...

On 8/26/15 10:26 PM, David Holmes wrote:
> Hi Dan,
>
> On 26/08/2015 7:08 AM, Daniel D. Daugherty wrote:
>> Greetings,
>>
>> I have a "fix" for a long standing race between JVM shutdown and the
>> JVM statistics subsystem:
>>
>> JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
>> https://bugs.openjdk.java.net/browse/JDK-8049304
>>
>> Webrev URL: 
>> http://cr.openjdk.java.net/~dcubed/8049304-webrev/0-jdk9-hs-rt/
>>
>> Testing: Aurora Adhoc RT-SVC nightly batch
>>           Aurora Adhoc vm.tmtools batch
>>           Kim's repro sequence for JDK-8049304
>>           Kim's repro sequence for JDK-8129978
>>           JPRT -testset hotspot
>>
>> This "fix":
>>
>> - adds a volatile flag to record whether PerfDataManager is holding
>>    data (PerfData objects)
>> - adds PerfDataManager::has_PerfData() to return the flag
>> - changes the Java monitor subsystem's use of PerfData to
>>    check both allocation of the monitor subsystem specific
>>    PerfData object and the new PerfDataManager::has_PerfData()
>>    return value
>>
>> If the global 'UsePerfData' option is false, the system works as
>> it did before. If 'UsePerfData' is true (the default on non-embedded
>> systems), the Java monitor subsystem will allocate a number of
>> PerfData objects to record information. The objects will record
>> information about Java monitor subsystem until the JVM shuts down.
>>
>> When the JVM starts to shutdown, the new PerfDataManager flag will
>> change to false and the Java monitor subsystem will stop using the
>> PerfData objects. This is the new behavior. As noted in the comments
>> I added to the code, the race is still present; I'm just changing
>> the order and the timing to reduce the likelihood of the crash.
>
> Right. To sum up: the basic problem is that the PerfData objects are 
> deallocated at the safepoint established for VM termination, but those 
> objects can actually be used by threads that are in a safepoint-safe 
> state: in particular within the low-level synchronization code.
>
> As you say this fix narrows the window where a crash can occur, but 
> can not close it. If a thread is descheduled after the check of 
> hasPerfData it can still access the PerfData object when it resumes, 
> which may be after the object was deallocated. There's no true fix 
> here without introducing synchronization (which would have to be even 
> lower-level to avoid reentrant use of the same code we're fixing!) and 
> the overhead of that would be prohibitive for these perf counters.
>
> In response to Kim's concern about other code that uses PerfData 
> objects I think you would have to examine those uses to see which, if 
> any, can occur from either a non-JavaThread, or from within the code 
> where a thread is considered safepoint-safe. I'm inclined to agree 
> that given we have not seen issues with such code, either it does not 
> exist or is extremely unlikely to hit this issue. Given the "fix" is 
> itself only narrowing the window it doesn't seem necessary to address 
> code that already has a narrower window.
>
> That all said "leaking" the PerfData objects seems no less unpleasant 
> a "fix". There are so many obstacles in the way of being able to 
> unload and re-load the JVM that I do not think this makes the position 
> measurably worse. In fact I can imagine that if we were to allow for 
> such behaviour we would need to be able to terminate threads and 
> reclaim all their resources (like Monitor instances), at which point 
> it would also become easy to deallocate shared memory like PerfData 
> objects.

Here's what I wrote in the bug report before I started this review cycle:

Daniel Daugherty added a comment - 2015-08-21 20:40
Continued investigating VM shutdown race:

JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
JDK-8129978 SIGSEGV when parsing command line options

    - Thanks to Kim for providing easy reproduction instructions for
      both bugs; I've tweaked the repro code a bit
    - The "correct" solution is to add a locking/memory ordering
      mechanism to ensure that PerfData is only used when valid.
      The locking/memory ordering would slow down the PerfData
      mechanism for every update. Ouch!
    - The "fast but safe" solution is to leak the PerfData memory
      and not clean them up at VM shutdown. We're trying to clean
      up the code base so the idea of intentionally leaking memory
      makes me cringe.
    - The solution I'm investigating is between "fast but safe"
      and "correct". I'm adding a PerfDataManager.has_PerfData()
      function that returns true when PerfDataManager is holding
      PerfData objects and false when none have been allocated
      or when they have been freed at VM shutdown. The flag
      holding the state is volatile and I use release_store()
      to change it so that publication is visible more quickly.
      On the VM shutdown path, I also do a 1ms sleep after setting
      the flag and before freeing the memory.
    - The idea is that the 1ms sleep will give any threads that
      saw PerfDataManager.has_PerfData() == true a chance to do
      their operation on the PerfData object before VM shutdown
      thread frees it.

So I think we're all roughly on the same page here:

1) We don't like the current system because we keep getting
    these shutdown race crashes. Of course, a new one came
    in early this AM:

    JDK-8134566 java/lang/invoke/LFCaching/LFMultiThreadCachingTest.java
                crashes in monitor synchronization code
    https://bugs.openjdk.java.net/browse/JDK-8134566

2) We don't like the "correct" solution because it would slow down
    the performance counters and possibly skew the very data we are
    trying to gather. Kim has also pointed out that adding more
    locking in a subsystem used by higher level locking is risky.

3) We don't like the "fast but safe" solution of leaking the PerfData
    memory. We try to make ourselves feel better about this by saying
    there are plenty of other leaks in the VM... slippery slope?

4) We don't like the proposed solution because the race still exists
    and we could continue to see failures like these. Only they would
    be more rare and possibly harder to spot.

5) Off-thread Kim and I have been talking about adding logic to the
    signal handler filters to detect a SIGSEGV that comes from use of
    a now freed PerfData object. We're mulling on the idea, but have
    not determined if it is even possible or an acceptable idea...

Hopefully, the above accurately sums up our options...

> I'll leave it up to you which way to go. As it stands this is Reviewed.

Thanks!

Here's my proposed plan:

1) I'd like to move forward with this change in order to reduce the
    occurrences of this crasher. Yes, I'm getting tired of seeing and
    analyzing them.

2) If we see PerfData crashes in the future with non-monitor subsystem
    PerfData usage, then we look at adding has_PerfData() calls to that
    subsystem.

3) If we see PerfData crashes in the future in the monitor subsystem,
    then that indicates that the theoretical race is real or I missed
    protecting a PerfData usage with has_PerfData(). If the race is
    real, then we examine these alternatives:

    - leak the PerfData objects on the JVM shutdown path, i.e.,
      switch to the "fast but safe" solution
    - add signal handler support to make PerfData SIGSEGVs benign

What do folks think?

Dan

>
> Thanks,
> David
>
>> Thanks, in advance, for any comments, questions or suggestions.
>>
>> Dan
>>