RFR (S) 8049304: race between VM_Exit and _sync_FutileWakeups->inc()

Fri Aug 28 15:52:33 UTC 2015

On 8/27/15 6:16 PM, Kim Barrett wrote:
> On Aug 27, 2015, at 5:42 PM, Daniel D. Daugherty <daniel.daugherty at oracle.com> wrote:
>> Sorry for starting another e-mail thread fork in an already complicated
>> review...
> OK, that was fascinating.  No, really, I mean it.

:-)

> It made me realize that we've been arguing and talking past each other
> in part because we're really dealing with two distinct though closely
> related bugs here.
>
> I've been primarily thinking about the case where we're calling
> vm_abort / os::abort, where the we presently delete the PerfData
> memory even though there can be arbitrary other threads running.  This
> was the case in JDK-8129978, which is how I got involved here in the
> first place.  In that bug we were in vm_exit_during_initialization and
> had called perfMemory_exit when some thread attempted to inflate a
> monitor (which is not one of the conflicting cases discussed by Dan).

Right. All abnormal exit cases would be considered conflicting
in what I wrote yesterday.

> The problem Dan has been looking at, JDK-8049304, is about a "normal"
> VM shutdown.  In this case, the problem is that we believe it is safe
> to delete the PerfData, because we've safepointed, and yet some thread
> unexpectedly runs and attempts to touch the deleted data anyway.

Right, I was concentrating on "normal".

> I think Dan's proposed fix (mostly) avoids the specific instance of
> JDK-8129978, but doesn't solve the more general problem of abnormal
> exit deleting the PerfData while some running thread is touching some
> non-monitor-related part of that data.  My proposal to leave it to the
> OS to deal with memory cleanup on process exit would deal with this
> case.

Now I'm starting to see that you are/were focused on a different
problem. OK. Definitely should look at what we can do here. If we
can avoid crashing a second time while dealing with another crash/
abnormal exit that would be good.

> I think Dan's proposed fix (mostly) avoids problems like JDK-8049304.
> And the approach I've been talking about doesn't help at all for this
> case.  But I wonder if Dan's proposed fix can be improved.  A "futile
> wakeup" case doesn't seem to me like one which requires super-high
> performance.  Would it be ok, in the two problematic cases that Dan
> identified, to use some kind of atomic / locking protocol with the
> cleanup?  Or is the comment for the counter increment in EnterI (and
> only there) correct that it's important to avoid a lock or atomics
> here (and presumably in ReenterI too).

This comment:

    570      // Keep a tally of the # of futile wakeups.
    571      // Note that the counter is not protected by a lock or 
updated by atomics.
    572      // That is by design - we trade "lossy" counters which are 
exposed to
    573      // races during updates for a lower probe effect.

and this comment:

    732      // Keep a tally of the # of futile wakeups.
    733      // Note that the counter is not protected by a lock or 
updated by atomics.
    734      // That is by design - we trade "lossy" counters which are 
exposed to
    735      // races during updates for a lower probe effect.

are not really specific to the monitor subsystem. I think
the comments are generally true about the perf counters.

As we discussed earlier in the thread, generally updating the perf
counters with syncs or locks will cost and potentially perturb the
things we are trying to count.

So I think what you're proposing is putting a lock protocol around
the setting of the flag and then have the non-safepoint-safe uses
grab that lock while the safepoint-safe uses skip the lock because
they can rely on the safepoint protocol in the "normal" exit case.

Do I have this right?

Dan