RFR (S) 8173361: various crashes in JvmtiExport::post_compiled_method_load

Fri Nov 22 19:15:50 UTC 2019

Dan, Thank you for reviewing this!

On 11/22/19 12:49 PM, Daniel D. Daugherty wrote:
> Hi Coleen,
>
> Sorry for the delay in getting back to this re-review.
>
>
> On 11/21/19 9:12 AM, coleen.phillimore at oracle.com wrote:
>>
>> Please review a new version of this change that keeps the nmethod 
>> from being unloaded, after it is added to the deferred event queue:
>>
>> http://cr.openjdk.java.net/~coleenp/2019/8173361.03/webrev/index.html
>
> src/hotspot/share/code/nmethod.cpp
>     No comments.
>
> src/hotspot/share/oops/instanceKlass.cpp
>     No comments.
>
> src/hotspot/share/prims/jvmtiExport.cpp
>     No comments.
>
> src/hotspot/share/prims/jvmtiImpl.cpp
>     Nice solution with the new oops_do() and nmethods_do() functions!
Erik's insistance!
>
>     old L988: void JvmtiDeferredEventQueue::enqueue(const 
> JvmtiDeferredEvent& event) {
>     new L998: void JvmtiDeferredEventQueue::enqueue(JvmtiDeferredEvent 
> event) {
>         Not sure why this was changed.
>
>         Update: Looks like Serguei raised the issue and Coleen has already
>         resolved it.

Yes.
>
> src/hotspot/share/prims/jvmtiImpl.hpp
>     old L494:     QueueNode(const JvmtiDeferredEvent& event)
>     new L498:     QueueNode(JvmtiDeferredEvent& event)
>         Why was this changed?
>
>         Update: Not clear if this was covered by Coleen's reply to 
> Serguei.
>
>     old L497:     const JvmtiDeferredEvent& event() const { return 
> _event; }
>     new L501:     JvmtiDeferredEvent& event() { return _event; }
>         Why was this changed?
>
>         Update: Coleen's reply to Serguei explained this. Perhaps add:
>                   // Not const because of oops_do() and nmethods_do().
>
>     old L509:   static void enqueue(const JvmtiDeferredEvent& event) 
> NOT_JVMTI_RETURN;
>     new L513:   static void enqueue(JvmtiDeferredEvent event) 
> NOT_JVMTI_RETURN;
>         Why was this changed?
>
>         Update: Looks like Serguei raised the issue and Coleen has already
>         resolved it.

Yes, I fixed these.
>
> src/hotspot/share/runtime/mutexLocker.cpp
>     This change is going to require some testing to make sure we don't
>     have any new deadlock scenarios.

Luckily, I've previously added an implicit NoSafepointVerifier to locks 
that are _allow_vm_block = true, like this one.

+ def(JmethodIdCreation_lock , PaddedMutex , leaf, true, 
_safepoint_check_never); // used for creating jmethodIDs.

which prevents one class of deadlock. If we take out another lock with a 
higher rank, we'll get the ranking assert.

This lock prevents insertion into an array, and has little outside calls.

I'm running tests in tier 1-6 but any code that travels through this 
should get these assertion checks, rather than deadlocking.

>
> src/hotspot/share/runtime/serviceThread.cpp
>     L50 - nit - why the extra blank line?

To separate static data member definitions from functions.  I removed it.
>
> src/hotspot/share/runtime/serviceThread.hpp
>     Thanks for cleaning up the static:
>
>       ServiceThread::is_service_thread(Thread* thread)
>
>     stuff. Having it be different than the other threads was
>     a bit jarring.
>
> src/hotspot/share/runtime/thread.hpp
>     No comments.
>
> Thumbs up. My only comments are nits so I don't need to see a
> new webrev if you decide to fix them.

So it turns out that in stress testing my fix 
forhttps://bugs.openjdk.java.net/browse/JDK-8212160

Because I was in the area and thought this was a duplicate of that bug 
(it is not).   I found that calling oops_do and nmethods_do the 
ServiceThread  needs to hold the Service_lock, because other threads can 
be adding things to the global queue while the sweeper thread is calling 
this in a handshake.

I am now retesting this change with the changes above, and with the 
Service_lock.   So far my stress tests for JDK-81212160 and the stress 
test for this bug pass, but I'm going to run through all the tiers 1-6 
over the weekend.

Please have a look at the changes in the meantime.

http://cr.openjdk.java.net/~coleenp/2019/8173361.04.incr/webrev
http://cr.openjdk.java.net/~coleenp/2019/8173361.04/webrev

Thanks,
Coleen
>
> Dan
>
>>
>> Ran the test that failed 100 times without failure, tier1 on Oracle 
>> supported platforms, and tier2-3 including jvmti and jdi tests locally.
>>
>> See bug for more details about the crash.
>>
>> https://bugs.openjdk.java.net/browse/JDK-8173361
>>
>> Thanks,
>> Coleen
>>
>> On 11/18/19 10:09 PM, coleen.phillimore at oracle.com wrote:
>>>
>>> Hi Serguei,
>>>
>>> Sorry for not sending an update.  I talked to Erik and am working on 
>>> a version that keeps the nmethod from being unloaded while it's in 
>>> the deferred event queue, with a version that the GC people will 
>>> like, and I like.  I'm testing it out now.
>>>
>>> Thanks!
>>> Coleen
>>>
>>>
>>> On 11/18/19 10:03 PM, serguei.spitsyn at oracle.com wrote:
>>>> Hi Coleen,
>>>>
>>>> Sorry for the latency, I had to investigate it a little bit.
>>>> I still have some doubt your fix is right thing to do.
>>>>
>>>>
>>>> On 11/16/19 04:55, coleen.phillimore at oracle.com wrote:
>>>>>
>>>>>
>>>>> On 11/15/19 11:17 PM, serguei.spitsyn at oracle.com wrote:
>>>>>> Hi Coleen,
>>>>>>
>>>>>> On 11/15/19 2:12 PM, coleen.phillimore at oracle.com wrote:
>>>>>>>
>>>>>>> Hi, I've been working on answers to these questions, so I'll 
>>>>>>> start with this one.
>>>>>>>
>>>>>>> The nmethodLocker keeps the nmethod from being reclaimed 
>>>>>>> (made_zombie or memory released) by the sweeper, but the nmethod 
>>>>>>> could be unloaded.  Unloading the nmethod clears the Method* 
>>>>>>> _method field.
>>>>>>
>>>>>> Yes, I see it is done in the nmethod::make_unloaded().
>>>>>>
>>>>>>> The post_compiled_method_load event needs the _method field to 
>>>>>>> look at things like inlining and ScopeDesc fields.   If the 
>>>>>>> nmethod is unloaded, some of the oops are dead.  There are 
>>>>>>> "holder" oops that correspond to the metadata in the nmethod.  
>>>>>>> If these oops are dead, causing the nmethod to get unloaded, 
>>>>>>> then the metadata may not be valid.
>>>>>>>
>>>>>>> So my change 02 looks for a NULL nmethod._method field to tell 
>>>>>>> whether we can post information about the nmethod.
>>>>>>>
>>>>>>> There's code in nmethod.cpp like:
>>>>>>>
>>>>>>> jmethodID nmethod::get_and_cache_jmethod_id() {
>>>>>>>   if (_jmethod_id == NULL) {
>>>>>>>     // Cache the jmethod_id since it can no longer be looked up 
>>>>>>> once the
>>>>>>>     // method itself has been marked for unloading.
>>>>>>>     _jmethod_id = method()->jmethod_id();
>>>>>>>   }
>>>>>>>   return _jmethod_id;
>>>>>>> }
>>>>>>>
>>>>>>> Which was added when post_method_load and unload were turned 
>>>>>>> into deferred events.
>>>>>>
>>>>>> Could we cache the jmethodID in the 
>>>>>> JvmtiDeferredEvent::compiled_method_load_event
>>>>>> similarly as we do in the 
>>>>>> JvmtiDeferredEvent::compiled_method_unload_event?
>>>>>> This would help to get rid of the dependency on the nmethod::_method.
>>>>>> Do we depend on any other nmethod fields?
>>>>>
>>>>> Yes, there are other nmethod metadata that we rely on to print 
>>>>> inline information, and this function 
>>>>> JvmtiCodeBlobEvents::build_jvmti_addr_location_map because it uses 
>>>>> the ScopeDesc data in the nmethod.
>>>>
>>>> One possible approach is to prepare and cache all this information
>>>> in the nmethod::post_compiled_method_load_event() before the
>>>> JvmtiDeferredEvent::compiled_method_load_event() is called.
>>>> The event parameters are:
>>>> typedef struct {
>>>>      const void* start_address;
>>>>      jlocation location;
>>>> } jvmtiAddrLocationMap;
>>>> CompiledMethodLoad(jvmtiEnv *jvmti_env,
>>>>              jmethodID method,
>>>>              jint code_size,
>>>>              const void* code_addr,
>>>>              jint map_length,
>>>>              const jvmtiAddrLocationMap* map,
>>>>              const void* compile_info)
>>>> Some of these addresses above could be not accessible when an event 
>>>> is posted.
>>>> Not sure yet if it is Okay.
>>>> The question is if this kind of refactoring is worth and right 
>>>> thing to do.
>>>>
>>>>>
>>>>> We do cache the jmethodID but that's not good enough.  See my last 
>>>>> comment in the bug report.  The jmethodID can point to an unloaded 
>>>>> method.
>>>>
>>>> This looks like it is done a little bit late.
>>>> It'd better to do it before the event is deferred (see above).
>>>>
>>>>> I tried a version of keeping the nmethod alive, but the GC folks 
>>>>> will hate it.  And it doesn't work and I hate it.
>>>>
>>>> From serviceability point of view this is the best and most 
>>>> consistent approach.
>>>> I seems to me, it was initially designed this way.
>>>> The downside is it adds some extra complexity to the GC.
>>>>
>>>>> My version 01 is the best, with the caveat that maybe it should 
>>>>> check for _method == NULL instead of nmethod->is_alive().  I have 
>>>>> to talk to Erik to see if there's a race with concurrent class 
>>>>> unloading.
>>>>>
>>>>> Any application that depends on a compiled method loading event on 
>>>>> a class that could be unloaded is a buggy application.  
>>>>> Applications should not rely on when the JIT compiler decides to 
>>>>> compile a method!  This happens to us for a stress test.  Most 
>>>>> applications will get most of their compiled method loading events 
>>>>> as they normally do.
>>>>
>>>> It is not an application that relies on the compiled method loading 
>>>> event.
>>>> It is about profiling tools to be able to get correct information 
>>>> about what is going on with compilations.
>>>> My concern is that if we skip such compiled method load events then 
>>>> profilers have no way
>>>> to find out there many unneeded compilations that are thrown away 
>>>> without any real use.
>>>> Also, it is not clear what happens with the subsequent compiled 
>>>> method unload events.
>>>> Are they going to be skipped as well or they can appear and confuse 
>>>> profilers?
>>>>
>>>>
>>>> Thanks,
>>>> Serguei
>>>>>
>>>>> Thanks,
>>>>> Coleen
>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Serguei
>>>>>>
>>>>>>> I put more debugging in the bug to show this crash was from an 
>>>>>>> unloaded nmethod.
>>>>>>>
>>>>>>> Coleen
>>>>>>>
>>>>>>>
>>>>>>> On 11/15/19 4:45 PM, serguei.spitsyn at oracle.com wrote:
>>>>>>>> Hi Coleen,
>>>>>>>>
>>>>>>>> I have some questions.
>>>>>>>>
>>>>>>>> Both the compiler method load and unload are posted as deferred 
>>>>>>>> events.
>>>>>>>> Both events keep the nmethod alive until the ServiceThread 
>>>>>>>> processes the event.
>>>>>>>>
>>>>>>>> The implementation is:
>>>>>>>>
>>>>>>>> JvmtiDeferredEvent 
>>>>>>>> JvmtiDeferredEvent::compiled_method_load_event(nmethod* nm) {
>>>>>>>>   . . .
>>>>>>>>   // Keep the nmethod alive until the ServiceThread can process
>>>>>>>>   // this deferred event.
>>>>>>>>   nmethodLocker::lock_nmethod(nm);
>>>>>>>>   return event;
>>>>>>>> }
>>>>>>>>
>>>>>>>> JvmtiDeferredEvent 
>>>>>>>> JvmtiDeferredEvent::compiled_method_unload_event(nmethod* nm, 
>>>>>>>> jmethodID id, const void* code) {
>>>>>>>>   . . .
>>>>>>>>   // Keep the nmethod alive until the ServiceThread can process
>>>>>>>>   // this deferred event. This will keep the memory for the
>>>>>>>>   // generated code from being reused too early. We pass
>>>>>>>>   // zombie_ok == true here so that our nmethod that was just
>>>>>>>>   // made into a zombie can be locked.
>>>>>>>>   nmethodLocker::lock_nmethod(nm, true /* zombie_ok */);
>>>>>>>>   return event;
>>>>>>>> }
>>>>>>>>
>>>>>>>> void JvmtiDeferredEvent::post() {
>>>>>>>> assert(ServiceThread::is_service_thread(Thread::current()),
>>>>>>>>          "Service thread must post enqueued events");
>>>>>>>>   switch(_type) {
>>>>>>>>     case TYPE_COMPILED_METHOD_LOAD: {
>>>>>>>>       nmethod* nm = _event_data.compiled_method_load;
>>>>>>>>       JvmtiExport::post_compiled_method_load(nm);
>>>>>>>>       // done with the deferred event so unlock the nmethod
>>>>>>>>       nmethodLocker::unlock_nmethod(nm);
>>>>>>>>       break;
>>>>>>>>     }
>>>>>>>>     case TYPE_COMPILED_METHOD_UNLOAD: {
>>>>>>>>       nmethod* nm = _event_data.compiled_method_unload.nm;
>>>>>>>>       JvmtiExport::post_compiled_method_unload(
>>>>>>>> _event_data.compiled_method_unload.method_id,
>>>>>>>> _event_data.compiled_method_unload.code_begin);
>>>>>>>>       // done with the deferred event so unlock the nmethod
>>>>>>>>       nmethodLocker::unlock_nmethod(nm);
>>>>>>>>       break;
>>>>>>>>     }
>>>>>>>>     . . .
>>>>>>>>   }
>>>>>>>> }
>>>>>>>>
>>>>>>>> Then I wonder how is it possible for the nmethod to be not 
>>>>>>>> alive here?:
>>>>>>>> 2168 void JvmtiExport::post_compiled_method_load(nmethod *nm) {
>>>>>>>> . . .
>>>>>>>> 2173 // It's not safe to look at metadata for unloaded methods.
>>>>>>>> 2174 if (!nm->is_alive()) {
>>>>>>>> 2175 return;
>>>>>>>> 2176 }
>>>>>>>> At least, it lokks like something else is broken.
>>>>>>>> Do I miss something important here?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Serguei
>>>>>>>>
>>>>>>>>
>>>>>>>> On 11/14/19 5:15 PM, coleen.phillimore at oracle.com wrote:
>>>>>>>>> Summary: Don't post information which uses metadata from 
>>>>>>>>> unloaded nmethods
>>>>>>>>>
>>>>>>>>> Tested tier1-3 and 100 times with test that failed (reproduced 
>>>>>>>>> failure without the fix).
>>>>>>>>>
>>>>>>>>> open webrev at 
>>>>>>>>> http://cr.openjdk.java.net/~coleenp/2019/8173361.01/webrev
>>>>>>>>> bug link https://bugs.openjdk.java.net/browse/JDK-8173361
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Coleen
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.java.net/pipermail/serviceability-dev/attachments/20191122/6622dd64/attachment-0001.html>