RFR (S) 8049304: race between VM_Exit and _sync_FutileWakeups->inc()

Fri Aug 28 14:45:08 UTC 2015

Hi,
One more pair of eyes on this.  8^)

On 8/27/2015 8:16 PM, Kim Barrett wrote:
> On Aug 27, 2015, at 5:42 PM, Daniel D. Daugherty <daniel.daugherty at oracle.com> wrote:
>> Sorry for starting another e-mail thread fork in an already complicated
>> review...
> OK, that was fascinating.  No, really, I mean it.
>
> It made me realize that we've been arguing and talking past each other
> in part because we're really dealing with two distinct though closely
> related bugs here.
>
> I've been primarily thinking about the case where we're calling
> vm_abort / os::abort, where the we presently delete the PerfData
> memory even though there can be arbitrary other threads running.  This
> was the case in JDK-8129978, which is how I got involved here in the
> first place.  In that bug we were in vm_exit_during_initialization and
> had called perfMemory_exit when some thread attempted to inflate a
> monitor (which is not one of the conflicting cases discussed by Dan).
>
> The problem Dan has been looking at, JDK-8049304, is about a "normal"
> VM shutdown.  In this case, the problem is that we believe it is safe
> to delete the PerfData, because we've safepointed, and yet some thread
> unexpectedly runs and attempts to touch the deleted data anyway.
>
> I think Dan's proposed fix (mostly) avoids the specific instance of
> JDK-8129978, but doesn't solve the more general problem of abnormal
> exit deleting the PerfData while some running thread is touching some
> non-monitor-related part of that data.  My proposal to leave it to the
> OS to deal with memory cleanup on process exit would deal with this
> case.
>
> I think Dan's proposed fix (mostly) avoids problems like JDK-8049304.
> And the approach I've been talking about doesn't help at all for this
> case.  But I wonder if Dan's proposed fix can be improved.  A "futile
> wakeup" case doesn't seem to me like one which requires super-high
> performance.  Would it be ok, in the two problematic cases that Dan
> identified, to use some kind of atomic / locking protocol with the
> cleanup?  Or is the comment for the counter increment in EnterI (and
> only there) correct that it's important to avoid a lock or atomics
> here (and presumably in ReenterI too).
>

I notice that EnteriI/ReenterI both end with OrderAccess::fence(). Can 
the potential update of _sync_FutileWakeups be delayed until that point, 
to take advantage of the fence to make the sync hole even smaller?   
You've got a release() (and and short nap!) with the store in 
PerfDataManager::destroy() to try to close the window somewhat.

But I think rather than the release_store()  you used, you want a store, 
followed by a release().   release_store() puts a fence before the store 
to ensure earlier updates are seen before the current one, no?

Also, I think the comment above that release_store() could be 
clarified.  It is fine as is if you're familiar with this bug report and 
discussion, but...  I think it should explicitly say there is still a 
very small window for the lack of true synchronization to cause a 
failure.  And perhaps that the release_store() (or store/release())  is 
not half of an acquire/release pair.

Tom