RFR: 8214556: Crash in DependencyContext::remove_dependent_nmethod still happens

Mon Dec 3 21:44:12 UTC 2018

> On Dec 3, 2018, at 10:29 AM, Erik Österlund <erik.osterlund at oracle.com> wrote:
> 
> Hi,
> 
> Unfortunately, even after 8213565, there are still crashes in DependencyContext::remove_dependent_nmethod.
> 
> The claiming mechanism I used for dependency context cleanup used the safepoint counter as a clock source for monotonically inrcemented cleaning epochs. What I forgot for a moment while fixing this, is that VM operations may be coalesced into the same safepoint. If two GC operations run in the same safepoint, dependency context cleaning will only run for the first GC operation, not any subsequent one (the cleanup tasks will appear to have already been claimed), causing stale dependency contexts to remain with pointers to unloaded nmethods. This would cause very intermittent crashes in subsequent operations on those dependency contexts such as the crashes that have been observed.
> 
> I fixed this by creating a stand-alone counter incremented each cleanup epoch, instead of using the safepoint counter for this.
> 
> Bug:
> https://bugs.openjdk.java.net/browse/JDK-8214556
> 
> Webrev:
> http://cr.openjdk.java.net/~eosterlund/8214556/webrev.00/
> 
> I have run this through tier1-6 on linux twice, I've also run tier1-3 on all platforms twice, and another tier1 on windows. Having said that, it's a very intermittent bug... but at least I know this is wrong, can cause these crashes, and needs to be fixed.
> 
> Thanks,
> /Erik

The approach looks good.  A few comments:

------------------------------------------------------------------------------
src/hotspot/share/prims/methodHandles.cpp
[Removed assert]
1087 void MethodHandles::clean_dependency_context(oop call_site) {
1088   assert_locked_or_safepoint(CodeCache_lock);

How is this assert removal related to the problem at hand?  Is this
something that was missed in JDK-8213565?  It looks right, but I'm
having trouble seeing that in the context of this change.

------------------------------------------------------------------------------
src/hotspot/share/code/dependencyContext.hpp
81   volatile uint64_t*       _last_cleanup_addr;
...
93   static uint64_t                _cleaning_epoch_monotonic;
94   static volatile uint64_t       _cleaning_epoch;

Do these really need to be uint64_t? It seems like 64bit volatile
(with atomic accesses) could be costly or problematic for a 32bit
platform. (And the Atomic documentation says verification is
required.) Maybe uintx? Though even just uint seems like it might be
adequate. Oh, but a value of zero has special meaning. So if the
global monotonic counter does wrap (which seems entirely possible)
then an extra increment would be needed. Alternatively, start it at 1
and increment by 2?

(The _monotonic suffix should perhaps also be reconsidered if such a
change is made.  "latest"?)

------------------------------------------------------------------------------ 
src/hotspot/share/code/dependencyContext.cpp
319 // and also makes subsequent releases of nmethodBuckets case immediate

[pre-existing]

s/case/cause/ ?

Also, the following sentence in that comment isn't clear to me:
"It is admitted to end the cleanup in a concurrent phase."

I think what is meant is that cleaning_end may occur in a concurrent
phase, but the comment's wording seems to me at least an awkward way
to say that.

------------------------------------------------------------------------------