RFR (S) 8173361: various crashes in JvmtiExport::post_compiled_method_load

Mon Nov 25 14:37:45 UTC 2019

Hi Coleen,

Still good BTW!

Thanks,
/Erik

On 2019-11-25 14:47, coleen.phillimore at oracle.com wrote:
> Thanks for the code review, Serguei!
> Coleen
>
> On 11/22/19 6:34 PM, serguei.spitsyn at oracle.com wrote:
>> Hi Coleen,
>>
>> +1
>>
>> Thanks,
>> Serguei
>>
>>
>> On 11/22/19 14:53, Daniel D. Daugherty wrote:
>>>> http://cr.openjdk.java.net/~coleenp/2019/8173361.04.incr/webrev
>>>
>>> src/hotspot/share/prims/jvmtiImpl.cpp
>>>     No comments.
>>>
>>> src/hotspot/share/prims/jvmtiImpl.hpp
>>>     No comments.
>>>
>>> src/hotspot/share/runtime/serviceThread.cpp
>>>     No comments.
>>>
>>> Thumbs up.
>>>
>>> Dan
>>>
>>>
>>> On 11/22/19 2:15 PM, coleen.phillimore at oracle.com wrote:
>>>>
>>>> Dan, Thank you for reviewing this!
>>>>
>>>> On 11/22/19 12:49 PM, Daniel D. Daugherty wrote:
>>>>> Hi Coleen,
>>>>>
>>>>> Sorry for the delay in getting back to this re-review.
>>>>>
>>>>>
>>>>> On 11/21/19 9:12 AM, coleen.phillimore at oracle.com wrote:
>>>>>>
>>>>>> Please review a new version of this change that keeps the nmethod 
>>>>>> from being unloaded, after it is added to the deferred event queue:
>>>>>>
>>>>>> http://cr.openjdk.java.net/~coleenp/2019/8173361.03/webrev/index.html
>>>>>
>>>>> src/hotspot/share/code/nmethod.cpp
>>>>>     No comments.
>>>>>
>>>>> src/hotspot/share/oops/instanceKlass.cpp
>>>>>     No comments.
>>>>>
>>>>> src/hotspot/share/prims/jvmtiExport.cpp
>>>>>     No comments.
>>>>>
>>>>> src/hotspot/share/prims/jvmtiImpl.cpp
>>>>>     Nice solution with the new oops_do() and nmethods_do() functions!
>>>> Erik's insistance!
>>>>>
>>>>>     old L988: void JvmtiDeferredEventQueue::enqueue(const 
>>>>> JvmtiDeferredEvent& event) {
>>>>>     new L998: void 
>>>>> JvmtiDeferredEventQueue::enqueue(JvmtiDeferredEvent event) {
>>>>>         Not sure why this was changed.
>>>>>
>>>>>         Update: Looks like Serguei raised the issue and Coleen has 
>>>>> already
>>>>>         resolved it.
>>>>
>>>> Yes.
>>>>>
>>>>> src/hotspot/share/prims/jvmtiImpl.hpp
>>>>>     old L494:     QueueNode(const JvmtiDeferredEvent& event)
>>>>>     new L498:     QueueNode(JvmtiDeferredEvent& event)
>>>>>         Why was this changed?
>>>>>
>>>>>         Update: Not clear if this was covered by Coleen's reply to 
>>>>> Serguei.
>>>>>
>>>>>     old L497:     const JvmtiDeferredEvent& event() const { return 
>>>>> _event; }
>>>>>     new L501:     JvmtiDeferredEvent& event() { return _event; }
>>>>>         Why was this changed?
>>>>>
>>>>>         Update: Coleen's reply to Serguei explained this. Perhaps add:
>>>>>                   // Not const because of oops_do() and nmethods_do().
>>>>>
>>>>>     old L509:   static void enqueue(const JvmtiDeferredEvent& 
>>>>> event) NOT_JVMTI_RETURN;
>>>>>     new L513:   static void enqueue(JvmtiDeferredEvent event) 
>>>>> NOT_JVMTI_RETURN;
>>>>>         Why was this changed?
>>>>>
>>>>>         Update: Looks like Serguei raised the issue and Coleen has 
>>>>> already
>>>>>         resolved it.
>>>>
>>>> Yes, I fixed these.
>>>>>
>>>>> src/hotspot/share/runtime/mutexLocker.cpp
>>>>>     This change is going to require some testing to make sure we don't
>>>>>     have any new deadlock scenarios.
>>>>
>>>> Luckily, I've previously added an implicit NoSafepointVerifier to 
>>>> locks that are _allow_vm_block = true, like this one.
>>>> + def(JmethodIdCreation_lock , PaddedMutex , leaf, true, 
>>>> _safepoint_check_never); // used for creating jmethodIDs.
>>>> which prevents one class of deadlock. If we take out another lock 
>>>> with a higher rank, we'll get the ranking assert.
>>>>
>>>> This lock prevents insertion into an array, and has little outside 
>>>> calls.
>>>>
>>>> I'm running tests in tier 1-6 but any code that travels through 
>>>> this should get these assertion checks, rather than deadlocking.
>>>>
>>>>>
>>>>> src/hotspot/share/runtime/serviceThread.cpp
>>>>>     L50 - nit - why the extra blank line?
>>>>
>>>> To separate static data member definitions from functions. I 
>>>> removed it.
>>>>>
>>>>> src/hotspot/share/runtime/serviceThread.hpp
>>>>>     Thanks for cleaning up the static:
>>>>>
>>>>>       ServiceThread::is_service_thread(Thread* thread)
>>>>>
>>>>>     stuff. Having it be different than the other threads was
>>>>>     a bit jarring.
>>>>>
>>>>> src/hotspot/share/runtime/thread.hpp
>>>>>     No comments.
>>>>>
>>>>> Thumbs up. My only comments are nits so I don't need to see a
>>>>> new webrev if you decide to fix them.
>>>>
>>>> So it turns out that in stress testing my fix 
>>>> forhttps://bugs.openjdk.java.net/browse/JDK-8212160
>>>>
>>>> Because I was in the area and thought this was a duplicate of that 
>>>> bug (it is not).   I found that calling oops_do and nmethods_do the 
>>>> ServiceThread  needs to hold the Service_lock, because other 
>>>> threads can be adding things to the global queue while the sweeper 
>>>> thread is calling this in a handshake.
>>>>
>>>> I am now retesting this change with the changes above, and with the 
>>>> Service_lock.   So far my stress tests for JDK-81212160 and the 
>>>> stress test for this bug pass, but I'm going to run through all the 
>>>> tiers 1-6 over the weekend.
>>>>
>>>> Please have a look at the changes in the meantime.
>>>>
>>>> http://cr.openjdk.java.net/~coleenp/2019/8173361.04.incr/webrev
>>>> http://cr.openjdk.java.net/~coleenp/2019/8173361.04/webrev
>>>>
>>>> Thanks,
>>>> Coleen
>>>>>
>>>>> Dan
>>>>>
>>>>>>
>>>>>> Ran the test that failed 100 times without failure, tier1 on 
>>>>>> Oracle supported platforms, and tier2-3 including jvmti and jdi 
>>>>>> tests locally.
>>>>>>
>>>>>> See bug for more details about the crash.
>>>>>>
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8173361
>>>>>>
>>>>>> Thanks,
>>>>>> Coleen
>>>>>>
>>>>>> On 11/18/19 10:09 PM, coleen.phillimore at oracle.com wrote:
>>>>>>>
>>>>>>> Hi Serguei,
>>>>>>>
>>>>>>> Sorry for not sending an update.  I talked to Erik and am 
>>>>>>> working on a version that keeps the nmethod from being unloaded 
>>>>>>> while it's in the deferred event queue, with a version that the 
>>>>>>> GC people will like, and I like.  I'm testing it out now.
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Coleen
>>>>>>>
>>>>>>>
>>>>>>> On 11/18/19 10:03 PM, serguei.spitsyn at oracle.com wrote:
>>>>>>>> Hi Coleen,
>>>>>>>>
>>>>>>>> Sorry for the latency, I had to investigate it a little bit.
>>>>>>>> I still have some doubt your fix is right thing to do.
>>>>>>>>
>>>>>>>>
>>>>>>>> On 11/16/19 04:55, coleen.phillimore at oracle.com wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 11/15/19 11:17 PM, serguei.spitsyn at oracle.com wrote:
>>>>>>>>>> Hi Coleen,
>>>>>>>>>>
>>>>>>>>>> On 11/15/19 2:12 PM, coleen.phillimore at oracle.com wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi, I've been working on answers to these questions, so I'll 
>>>>>>>>>>> start with this one.
>>>>>>>>>>>
>>>>>>>>>>> The nmethodLocker keeps the nmethod from being reclaimed 
>>>>>>>>>>> (made_zombie or memory released) by the sweeper, but the 
>>>>>>>>>>> nmethod could be unloaded.  Unloading the nmethod clears the 
>>>>>>>>>>> Method* _method field.
>>>>>>>>>>
>>>>>>>>>> Yes, I see it is done in the nmethod::make_unloaded().
>>>>>>>>>>
>>>>>>>>>>> The post_compiled_method_load event needs the _method field 
>>>>>>>>>>> to look at things like inlining and ScopeDesc fields.   If 
>>>>>>>>>>> the nmethod is unloaded, some of the oops are dead.  There 
>>>>>>>>>>> are "holder" oops that correspond to the metadata in the 
>>>>>>>>>>> nmethod.  If these oops are dead, causing the nmethod to get 
>>>>>>>>>>> unloaded, then the metadata may not be valid.
>>>>>>>>>>>
>>>>>>>>>>> So my change 02 looks for a NULL nmethod._method field to 
>>>>>>>>>>> tell whether we can post information about the nmethod.
>>>>>>>>>>>
>>>>>>>>>>> There's code in nmethod.cpp like:
>>>>>>>>>>>
>>>>>>>>>>> jmethodID nmethod::get_and_cache_jmethod_id() {
>>>>>>>>>>>   if (_jmethod_id == NULL) {
>>>>>>>>>>>     // Cache the jmethod_id since it can no longer be looked 
>>>>>>>>>>> up once the
>>>>>>>>>>>     // method itself has been marked for unloading.
>>>>>>>>>>>     _jmethod_id = method()->jmethod_id();
>>>>>>>>>>>   }
>>>>>>>>>>>   return _jmethod_id;
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> Which was added when post_method_load and unload were turned 
>>>>>>>>>>> into deferred events.
>>>>>>>>>>
>>>>>>>>>> Could we cache the jmethodID in the 
>>>>>>>>>> JvmtiDeferredEvent::compiled_method_load_event
>>>>>>>>>> similarly as we do in the 
>>>>>>>>>> JvmtiDeferredEvent::compiled_method_unload_event?
>>>>>>>>>> This would help to get rid of the dependency on the 
>>>>>>>>>> nmethod::_method.
>>>>>>>>>> Do we depend on any other nmethod fields?
>>>>>>>>>
>>>>>>>>> Yes, there are other nmethod metadata that we rely on to print 
>>>>>>>>> inline information, and this function 
>>>>>>>>> JvmtiCodeBlobEvents::build_jvmti_addr_location_map because it 
>>>>>>>>> uses the ScopeDesc data in the nmethod.
>>>>>>>>
>>>>>>>> One possible approach is to prepare and cache all this information
>>>>>>>> in the nmethod::post_compiled_method_load_event() before the
>>>>>>>> JvmtiDeferredEvent::compiled_method_load_event() is called.
>>>>>>>> The event parameters are:
>>>>>>>> typedef struct {
>>>>>>>>      const void* start_address;
>>>>>>>>      jlocation location;
>>>>>>>> } jvmtiAddrLocationMap;
>>>>>>>> CompiledMethodLoad(jvmtiEnv *jvmti_env,
>>>>>>>>              jmethodID method,
>>>>>>>>              jint code_size,
>>>>>>>>              const void* code_addr,
>>>>>>>>              jint map_length,
>>>>>>>>              const jvmtiAddrLocationMap* map,
>>>>>>>>              const void* compile_info)
>>>>>>>> Some of these addresses above could be not accessible when an 
>>>>>>>> event is posted.
>>>>>>>> Not sure yet if it is Okay.
>>>>>>>> The question is if this kind of refactoring is worth and right 
>>>>>>>> thing to do.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> We do cache the jmethodID but that's not good enough.  See my 
>>>>>>>>> last comment in the bug report. The jmethodID can point to an 
>>>>>>>>> unloaded method.
>>>>>>>>
>>>>>>>> This looks like it is done a little bit late.
>>>>>>>> It'd better to do it before the event is deferred (see above).
>>>>>>>>
>>>>>>>>> I tried a version of keeping the nmethod alive, but the GC 
>>>>>>>>> folks will hate it.  And it doesn't work and I hate it.
>>>>>>>>
>>>>>>>> From serviceability point of view this is the best and most 
>>>>>>>> consistent approach.
>>>>>>>> I seems to me, it was initially designed this way.
>>>>>>>> The downside is it adds some extra complexity to the GC.
>>>>>>>>
>>>>>>>>> My version 01 is the best, with the caveat that maybe it 
>>>>>>>>> should check for _method == NULL instead of 
>>>>>>>>> nmethod->is_alive().  I have to talk to Erik to see if there's 
>>>>>>>>> a race with concurrent class unloading.
>>>>>>>>>
>>>>>>>>> Any application that depends on a compiled method loading 
>>>>>>>>> event on a class that could be unloaded is a buggy 
>>>>>>>>> application.  Applications should not rely on when the JIT 
>>>>>>>>> compiler decides to compile a method!  This happens to us for 
>>>>>>>>> a stress test. Most applications will get most of their 
>>>>>>>>> compiled method loading events as they normally do.
>>>>>>>>
>>>>>>>> It is not an application that relies on the compiled method 
>>>>>>>> loading event.
>>>>>>>> It is about profiling tools to be able to get correct 
>>>>>>>> information about what is going on with compilations.
>>>>>>>> My concern is that if we skip such compiled method load events 
>>>>>>>> then profilers have no way
>>>>>>>> to find out there many unneeded compilations that are thrown 
>>>>>>>> away without any real use.
>>>>>>>> Also, it is not clear what happens with the subsequent compiled 
>>>>>>>> method unload events.
>>>>>>>> Are they going to be skipped as well or they can appear and 
>>>>>>>> confuse profilers?
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Serguei
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Coleen
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Serguei
>>>>>>>>>>
>>>>>>>>>>> I put more debugging in the bug to show this crash was from 
>>>>>>>>>>> an unloaded nmethod.
>>>>>>>>>>>
>>>>>>>>>>> Coleen
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 11/15/19 4:45 PM, serguei.spitsyn at oracle.com wrote:
>>>>>>>>>>>> Hi Coleen,
>>>>>>>>>>>>
>>>>>>>>>>>> I have some questions.
>>>>>>>>>>>>
>>>>>>>>>>>> Both the compiler method load and unload are posted as 
>>>>>>>>>>>> deferred events.
>>>>>>>>>>>> Both events keep the nmethod alive until the ServiceThread 
>>>>>>>>>>>> processes the event.
>>>>>>>>>>>>
>>>>>>>>>>>> The implementation is:
>>>>>>>>>>>>
>>>>>>>>>>>> JvmtiDeferredEvent 
>>>>>>>>>>>> JvmtiDeferredEvent::compiled_method_load_event(nmethod* nm) {
>>>>>>>>>>>>   . . .
>>>>>>>>>>>>   // Keep the nmethod alive until the ServiceThread can process
>>>>>>>>>>>>   // this deferred event.
>>>>>>>>>>>>   nmethodLocker::lock_nmethod(nm);
>>>>>>>>>>>>   return event;
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> JvmtiDeferredEvent 
>>>>>>>>>>>> JvmtiDeferredEvent::compiled_method_unload_event(nmethod* 
>>>>>>>>>>>> nm, jmethodID id, const void* code) {
>>>>>>>>>>>>   . . .
>>>>>>>>>>>>   // Keep the nmethod alive until the ServiceThread can process
>>>>>>>>>>>>   // this deferred event. This will keep the memory for the
>>>>>>>>>>>>   // generated code from being reused too early. We pass
>>>>>>>>>>>>   // zombie_ok == true here so that our nmethod that was just
>>>>>>>>>>>>   // made into a zombie can be locked.
>>>>>>>>>>>>   nmethodLocker::lock_nmethod(nm, true /* zombie_ok */);
>>>>>>>>>>>>   return event;
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> void JvmtiDeferredEvent::post() {
>>>>>>>>>>>> assert(ServiceThread::is_service_thread(Thread::current()),
>>>>>>>>>>>>          "Service thread must post enqueued events");
>>>>>>>>>>>>   switch(_type) {
>>>>>>>>>>>>     case TYPE_COMPILED_METHOD_LOAD: {
>>>>>>>>>>>>       nmethod* nm = _event_data.compiled_method_load;
>>>>>>>>>>>> JvmtiExport::post_compiled_method_load(nm);
>>>>>>>>>>>>       // done with the deferred event so unlock the nmethod
>>>>>>>>>>>>       nmethodLocker::unlock_nmethod(nm);
>>>>>>>>>>>>       break;
>>>>>>>>>>>>     }
>>>>>>>>>>>>     case TYPE_COMPILED_METHOD_UNLOAD: {
>>>>>>>>>>>>       nmethod* nm = _event_data.compiled_method_unload.nm;
>>>>>>>>>>>> JvmtiExport::post_compiled_method_unload(
>>>>>>>>>>>> _event_data.compiled_method_unload.method_id,
>>>>>>>>>>>> _event_data.compiled_method_unload.code_begin);
>>>>>>>>>>>>       // done with the deferred event so unlock the nmethod
>>>>>>>>>>>>       nmethodLocker::unlock_nmethod(nm);
>>>>>>>>>>>>       break;
>>>>>>>>>>>>     }
>>>>>>>>>>>>     . . .
>>>>>>>>>>>>   }
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> Then I wonder how is it possible for the nmethod to be not 
>>>>>>>>>>>> alive here?:
>>>>>>>>>>>> 2168 void JvmtiExport::post_compiled_method_load(nmethod *nm) {
>>>>>>>>>>>> . . .
>>>>>>>>>>>> 2173 // It's not safe to look at metadata for unloaded methods.
>>>>>>>>>>>> 2174 if (!nm->is_alive()) {
>>>>>>>>>>>> 2175 return;
>>>>>>>>>>>> 2176 }
>>>>>>>>>>>> At least, it lokks like something else is broken.
>>>>>>>>>>>> Do I miss something important here?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Serguei
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 11/14/19 5:15 PM, coleen.phillimore at oracle.com wrote:
>>>>>>>>>>>>> Summary: Don't post information which uses metadata from 
>>>>>>>>>>>>> unloaded nmethods
>>>>>>>>>>>>>
>>>>>>>>>>>>> Tested tier1-3 and 100 times with test that failed 
>>>>>>>>>>>>> (reproduced failure without the fix).
>>>>>>>>>>>>>
>>>>>>>>>>>>> open webrev at 
>>>>>>>>>>>>> http://cr.openjdk.java.net/~coleenp/2019/8173361.01/webrev
>>>>>>>>>>>>> bug link https://bugs.openjdk.java.net/browse/JDK-8173361
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Coleen
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.java.net/pipermail/serviceability-dev/attachments/20191125/1e76bd37/attachment-0001.html>