RFR (S) 8049304: race between VM_Exit and _sync_FutileWakeups->inc()

Thu Aug 27 04:26:03 UTC 2015

Hi Dan,

On 26/08/2015 7:08 AM, Daniel D. Daugherty wrote:
> Greetings,
>
> I have a "fix" for a long standing race between JVM shutdown and the
> JVM statistics subsystem:
>
> JDK-8049304 race between VM_Exit and _sync_FutileWakeups->inc()
> https://bugs.openjdk.java.net/browse/JDK-8049304
>
> Webrev URL: http://cr.openjdk.java.net/~dcubed/8049304-webrev/0-jdk9-hs-rt/
>
> Testing: Aurora Adhoc RT-SVC nightly batch
>           Aurora Adhoc vm.tmtools batch
>           Kim's repro sequence for JDK-8049304
>           Kim's repro sequence for JDK-8129978
>           JPRT -testset hotspot
>
> This "fix":
>
> - adds a volatile flag to record whether PerfDataManager is holding
>    data (PerfData objects)
> - adds PerfDataManager::has_PerfData() to return the flag
> - changes the Java monitor subsystem's use of PerfData to
>    check both allocation of the monitor subsystem specific
>    PerfData object and the new PerfDataManager::has_PerfData()
>    return value
>
> If the global 'UsePerfData' option is false, the system works as
> it did before. If 'UsePerfData' is true (the default on non-embedded
> systems), the Java monitor subsystem will allocate a number of
> PerfData objects to record information. The objects will record
> information about Java monitor subsystem until the JVM shuts down.
>
> When the JVM starts to shutdown, the new PerfDataManager flag will
> change to false and the Java monitor subsystem will stop using the
> PerfData objects. This is the new behavior. As noted in the comments
> I added to the code, the race is still present; I'm just changing
> the order and the timing to reduce the likelihood of the crash.

Right. To sum up: the basic problem is that the PerfData objects are 
deallocated at the safepoint established for VM termination, but those 
objects can actually be used by threads that are in a safepoint-safe 
state: in particular within the low-level synchronization code.

As you say this fix narrows the window where a crash can occur, but can 
not close it. If a thread is descheduled after the check of hasPerfData 
it can still access the PerfData object when it resumes, which may be 
after the object was deallocated. There's no true fix here without 
introducing synchronization (which would have to be even lower-level to 
avoid reentrant use of the same code we're fixing!) and the overhead of 
that would be prohibitive for these perf counters.

In response to Kim's concern about other code that uses PerfData objects 
I think you would have to examine those uses to see which, if any, can 
occur from either a non-JavaThread, or from within the code where a 
thread is considered safepoint-safe. I'm inclined to agree that given we 
have not seen issues with such code, either it does not exist or is 
extremely unlikely to hit this issue. Given the "fix" is itself only 
narrowing the window it doesn't seem necessary to address code that 
already has a narrower window.

That all said "leaking" the PerfData objects seems no less unpleasant a 
"fix". There are so many obstacles in the way of being able to unload 
and re-load the JVM that I do not think this makes the position 
measurably worse. In fact I can imagine that if we were to allow for 
such behaviour we would need to be able to terminate threads and reclaim 
all their resources (like Monitor instances), at which point it would 
also become easy to deallocate shared memory like PerfData objects.

I'll leave it up to you which way to go. As it stands this is Reviewed.

Thanks,
David

> Thanks, in advance, for any comments, questions or suggestions.
>
> Dan
>