RFR (S) 8049304: race between VM_Exit and _sync_FutileWakeups->inc()
Tom Benson
tom.benson at oracle.com
Fri Aug 28 14:45:08 UTC 2015
Hi,
One more pair of eyes on this. 8^)
On 8/27/2015 8:16 PM, Kim Barrett wrote:
> On Aug 27, 2015, at 5:42 PM, Daniel D. Daugherty <daniel.daugherty at oracle.com> wrote:
>> Sorry for starting another e-mail thread fork in an already complicated
>> review...
> OK, that was fascinating. No, really, I mean it.
>
> It made me realize that we've been arguing and talking past each other
> in part because we're really dealing with two distinct though closely
> related bugs here.
>
> I've been primarily thinking about the case where we're calling
> vm_abort / os::abort, where the we presently delete the PerfData
> memory even though there can be arbitrary other threads running. This
> was the case in JDK-8129978, which is how I got involved here in the
> first place. In that bug we were in vm_exit_during_initialization and
> had called perfMemory_exit when some thread attempted to inflate a
> monitor (which is not one of the conflicting cases discussed by Dan).
>
> The problem Dan has been looking at, JDK-8049304, is about a "normal"
> VM shutdown. In this case, the problem is that we believe it is safe
> to delete the PerfData, because we've safepointed, and yet some thread
> unexpectedly runs and attempts to touch the deleted data anyway.
>
> I think Dan's proposed fix (mostly) avoids the specific instance of
> JDK-8129978, but doesn't solve the more general problem of abnormal
> exit deleting the PerfData while some running thread is touching some
> non-monitor-related part of that data. My proposal to leave it to the
> OS to deal with memory cleanup on process exit would deal with this
> case.
>
> I think Dan's proposed fix (mostly) avoids problems like JDK-8049304.
> And the approach I've been talking about doesn't help at all for this
> case. But I wonder if Dan's proposed fix can be improved. A "futile
> wakeup" case doesn't seem to me like one which requires super-high
> performance. Would it be ok, in the two problematic cases that Dan
> identified, to use some kind of atomic / locking protocol with the
> cleanup? Or is the comment for the counter increment in EnterI (and
> only there) correct that it's important to avoid a lock or atomics
> here (and presumably in ReenterI too).
>
I notice that EnteriI/ReenterI both end with OrderAccess::fence(). Can
the potential update of _sync_FutileWakeups be delayed until that point,
to take advantage of the fence to make the sync hole even smaller?
You've got a release() (and and short nap!) with the store in
PerfDataManager::destroy() to try to close the window somewhat.
But I think rather than the release_store() you used, you want a store,
followed by a release(). release_store() puts a fence before the store
to ensure earlier updates are seen before the current one, no?
Also, I think the comment above that release_store() could be
clarified. It is fine as is if you're familiar with this bug report and
discussion, but... I think it should explicitly say there is still a
very small window for the lack of true synchronization to cause a
failure. And perhaps that the release_store() (or store/release()) is
not half of an acquire/release pair.
Tom
More information about the serviceability-dev
mailing list