RFR: 8214556: Crash in DependencyContext::remove_dependent_nmethod still happens

Mon Dec 3 22:13:48 UTC 2018

Hi Kim,

Thank you for reviewing this.

New full webrev:
http://cr.openjdk.java.net/~eosterlund/8214556/webrev.01/

Incremental:
http://cr.openjdk.java.net/~eosterlund/8214556/webrev.00_01/

On 2018-12-03 22:44, Kim Barrett wrote:
>> On Dec 3, 2018, at 10:29 AM, Erik Österlund <erik.osterlund at oracle.com> wrote:
>>
>> Hi,
>>
>> Unfortunately, even after 8213565, there are still crashes in DependencyContext::remove_dependent_nmethod.
>>
>> The claiming mechanism I used for dependency context cleanup used the safepoint counter as a clock source for monotonically inrcemented cleaning epochs. What I forgot for a moment while fixing this, is that VM operations may be coalesced into the same safepoint. If two GC operations run in the same safepoint, dependency context cleaning will only run for the first GC operation, not any subsequent one (the cleanup tasks will appear to have already been claimed), causing stale dependency contexts to remain with pointers to unloaded nmethods. This would cause very intermittent crashes in subsequent operations on those dependency contexts such as the crashes that have been observed.
>>
>> I fixed this by creating a stand-alone counter incremented each cleanup epoch, instead of using the safepoint counter for this.
>>
>> Bug:
>> https://bugs.openjdk.java.net/browse/JDK-8214556
>>
>> Webrev:
>> http://cr.openjdk.java.net/~eosterlund/8214556/webrev.00/
>>
>> I have run this through tier1-6 on linux twice, I've also run tier1-3 on all platforms twice, and another tier1 on windows. Having said that, it's a very intermittent bug... but at least I know this is wrong, can cause these crashes, and needs to be fixed.
>>
>> Thanks,
>> /Erik
> 
> The approach looks good.  A few comments:
> 
> ------------------------------------------------------------------------------
> src/hotspot/share/prims/methodHandles.cpp
> [Removed assert]
> 1087 void MethodHandles::clean_dependency_context(oop call_site) {
> 1088   assert_locked_or_safepoint(CodeCache_lock);
> 
> How is this assert removal related to the problem at hand?  Is this
> something that was missed in JDK-8213565?  It looks right, but I'm
> having trouble seeing that in the context of this change.

Yes, it is something that was missed in JDK-8213565.

> ------------------------------------------------------------------------------
> src/hotspot/share/code/dependencyContext.hpp
> 81   volatile uint64_t*       _last_cleanup_addr;
> ...
> 93   static uint64_t                _cleaning_epoch_monotonic;
> 94   static volatile uint64_t       _cleaning_epoch;
> 
> Do these really need to be uint64_t? It seems like 64bit volatile
> (with atomic accesses) could be costly or problematic for a 32bit
> platform. (And the Atomic documentation says verification is
> required.) Maybe uintx? Though even just uint seems like it might be
> adequate. Oh, but a value of zero has special meaning. So if the
> global monotonic counter does wrap (which seems entirely possible)
> then an extra increment would be needed. Alternatively, start it at 1
> and increment by 2?
> 
> (The _monotonic suffix should perhaps also be reconsidered if such a
> change is made.  "latest"?)

I would like to keep this counter uint64_t. If wrapping happens, we are 
in a lot of trouble. All platforms support atomic load/store on 
uint64_t, including 32 bit platforms. It's other atomic operations like 
add-and-fetch that does not necessarily support that on 32 bit 
platforms. And I only need loads and stores.

> ------------------------------------------------------------------------------
> src/hotspot/share/code/dependencyContext.cpp
> 319 // and also makes subsequent releases of nmethodBuckets case immediate
> 
> [pre-existing]
> 
> s/case/cause/ ?

Fixed.

> Also, the following sentence in that comment isn't clear to me:
> "It is admitted to end the cleanup in a concurrent phase."
> 
> I think what is meant is that cleaning_end may occur in a concurrent
> phase, but the comment's wording seems to me at least an awkward way
> to say that.

Yes, you understood it correctly. I tried to make it more clear in the 
new webrev.

Thanks,
/Erik

> ------------------------------------------------------------------------------
>