RFR (S) 8173361: various crashes in JvmtiExport::post_compiled_method_load

Mon Dec 2 13:42:09 UTC 2019

Thanks Erik!
Coleen

On 11/25/19 9:37 AM, Erik Österlund wrote:
> Hi Coleen,
>
> Still good BTW!
>
> Thanks,
> /Erik
>
> On 2019-11-25 14:47, coleen.phillimore at oracle.com wrote:
>> Thanks for the code review, Serguei!
>> Coleen
>>
>> On 11/22/19 6:34 PM, serguei.spitsyn at oracle.com wrote:
>>> Hi Coleen,
>>>
>>> +1
>>>
>>> Thanks,
>>> Serguei
>>>
>>>
>>> On 11/22/19 14:53, Daniel D. Daugherty wrote:
>>>>> http://cr.openjdk.java.net/~coleenp/2019/8173361.04.incr/webrev
>>>>
>>>> src/hotspot/share/prims/jvmtiImpl.cpp
>>>>     No comments.
>>>>
>>>> src/hotspot/share/prims/jvmtiImpl.hpp
>>>>     No comments.
>>>>
>>>> src/hotspot/share/runtime/serviceThread.cpp
>>>>     No comments.
>>>>
>>>> Thumbs up.
>>>>
>>>> Dan
>>>>
>>>>
>>>> On 11/22/19 2:15 PM, coleen.phillimore at oracle.com wrote:
>>>>>
>>>>> Dan, Thank you for reviewing this!
>>>>>
>>>>> On 11/22/19 12:49 PM, Daniel D. Daugherty wrote:
>>>>>> Hi Coleen,
>>>>>>
>>>>>> Sorry for the delay in getting back to this re-review.
>>>>>>
>>>>>>
>>>>>> On 11/21/19 9:12 AM, coleen.phillimore at oracle.com wrote:
>>>>>>>
>>>>>>> Please review a new version of this change that keeps the 
>>>>>>> nmethod from being unloaded, after it is added to the deferred 
>>>>>>> event queue:
>>>>>>>
>>>>>>> http://cr.openjdk.java.net/~coleenp/2019/8173361.03/webrev/index.html
>>>>>>
>>>>>> src/hotspot/share/code/nmethod.cpp
>>>>>>     No comments.
>>>>>>
>>>>>> src/hotspot/share/oops/instanceKlass.cpp
>>>>>>     No comments.
>>>>>>
>>>>>> src/hotspot/share/prims/jvmtiExport.cpp
>>>>>>     No comments.
>>>>>>
>>>>>> src/hotspot/share/prims/jvmtiImpl.cpp
>>>>>>     Nice solution with the new oops_do() and nmethods_do() functions!
>>>>> Erik's insistance!
>>>>>>
>>>>>>     old L988: void JvmtiDeferredEventQueue::enqueue(const 
>>>>>> JvmtiDeferredEvent& event) {
>>>>>>     new L998: void 
>>>>>> JvmtiDeferredEventQueue::enqueue(JvmtiDeferredEvent event) {
>>>>>>         Not sure why this was changed.
>>>>>>
>>>>>>         Update: Looks like Serguei raised the issue and Coleen 
>>>>>> has already
>>>>>>         resolved it.
>>>>>
>>>>> Yes.
>>>>>>
>>>>>> src/hotspot/share/prims/jvmtiImpl.hpp
>>>>>>     old L494:     QueueNode(const JvmtiDeferredEvent& event)
>>>>>>     new L498:     QueueNode(JvmtiDeferredEvent& event)
>>>>>>         Why was this changed?
>>>>>>
>>>>>>         Update: Not clear if this was covered by Coleen's reply 
>>>>>> to Serguei.
>>>>>>
>>>>>>     old L497:     const JvmtiDeferredEvent& event() const { 
>>>>>> return _event; }
>>>>>>     new L501:     JvmtiDeferredEvent& event() { return _event; }
>>>>>>         Why was this changed?
>>>>>>
>>>>>>         Update: Coleen's reply to Serguei explained this. Perhaps 
>>>>>> add:
>>>>>>                   // Not const because of oops_do() and 
>>>>>> nmethods_do().
>>>>>>
>>>>>>     old L509:   static void enqueue(const JvmtiDeferredEvent& 
>>>>>> event) NOT_JVMTI_RETURN;
>>>>>>     new L513:   static void enqueue(JvmtiDeferredEvent event) 
>>>>>> NOT_JVMTI_RETURN;
>>>>>>         Why was this changed?
>>>>>>
>>>>>>         Update: Looks like Serguei raised the issue and Coleen 
>>>>>> has already
>>>>>>         resolved it.
>>>>>
>>>>> Yes, I fixed these.
>>>>>>
>>>>>> src/hotspot/share/runtime/mutexLocker.cpp
>>>>>>     This change is going to require some testing to make sure we 
>>>>>> don't
>>>>>>     have any new deadlock scenarios.
>>>>>
>>>>> Luckily, I've previously added an implicit NoSafepointVerifier to 
>>>>> locks that are _allow_vm_block = true, like this one.
>>>>> + def(JmethodIdCreation_lock , PaddedMutex , leaf, true, 
>>>>> _safepoint_check_never); // used for creating jmethodIDs.
>>>>> which prevents one class of deadlock. If we take out another lock 
>>>>> with a higher rank, we'll get the ranking assert.
>>>>>
>>>>> This lock prevents insertion into an array, and has little outside 
>>>>> calls.
>>>>>
>>>>> I'm running tests in tier 1-6 but any code that travels through 
>>>>> this should get these assertion checks, rather than deadlocking.
>>>>>
>>>>>>
>>>>>> src/hotspot/share/runtime/serviceThread.cpp
>>>>>>     L50 - nit - why the extra blank line?
>>>>>
>>>>> To separate static data member definitions from functions.  I 
>>>>> removed it.
>>>>>>
>>>>>> src/hotspot/share/runtime/serviceThread.hpp
>>>>>>     Thanks for cleaning up the static:
>>>>>>
>>>>>>       ServiceThread::is_service_thread(Thread* thread)
>>>>>>
>>>>>>     stuff. Having it be different than the other threads was
>>>>>>     a bit jarring.
>>>>>>
>>>>>> src/hotspot/share/runtime/thread.hpp
>>>>>>     No comments.
>>>>>>
>>>>>> Thumbs up. My only comments are nits so I don't need to see a
>>>>>> new webrev if you decide to fix them.
>>>>>
>>>>> So it turns out that in stress testing my fix 
>>>>> forhttps://bugs.openjdk.java.net/browse/JDK-8212160
>>>>>
>>>>> Because I was in the area and thought this was a duplicate of that 
>>>>> bug (it is not).   I found that calling oops_do and nmethods_do 
>>>>> the ServiceThread  needs to hold the Service_lock, because other 
>>>>> threads can be adding things to the global queue while the sweeper 
>>>>> thread is calling this in a handshake.
>>>>>
>>>>> I am now retesting this change with the changes above, and with 
>>>>> the Service_lock.   So far my stress tests for JDK-81212160 and 
>>>>> the stress test for this bug pass, but I'm going to run through 
>>>>> all the tiers 1-6 over the weekend.
>>>>>
>>>>> Please have a look at the changes in the meantime.
>>>>>
>>>>> http://cr.openjdk.java.net/~coleenp/2019/8173361.04.incr/webrev
>>>>> http://cr.openjdk.java.net/~coleenp/2019/8173361.04/webrev
>>>>>
>>>>> Thanks,
>>>>> Coleen
>>>>>>
>>>>>> Dan
>>>>>>
>>>>>>>
>>>>>>> Ran the test that failed 100 times without failure, tier1 on 
>>>>>>> Oracle supported platforms, and tier2-3 including jvmti and jdi 
>>>>>>> tests locally.
>>>>>>>
>>>>>>> See bug for more details about the crash.
>>>>>>>
>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8173361
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Coleen
>>>>>>>
>>>>>>> On 11/18/19 10:09 PM, coleen.phillimore at oracle.com wrote:
>>>>>>>>
>>>>>>>> Hi Serguei,
>>>>>>>>
>>>>>>>> Sorry for not sending an update.  I talked to Erik and am 
>>>>>>>> working on a version that keeps the nmethod from being unloaded 
>>>>>>>> while it's in the deferred event queue, with a version that the 
>>>>>>>> GC people will like, and I like.  I'm testing it out now.
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>> Coleen
>>>>>>>>
>>>>>>>>
>>>>>>>> On 11/18/19 10:03 PM, serguei.spitsyn at oracle.com wrote:
>>>>>>>>> Hi Coleen,
>>>>>>>>>
>>>>>>>>> Sorry for the latency, I had to investigate it a little bit.
>>>>>>>>> I still have some doubt your fix is right thing to do.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 11/16/19 04:55, coleen.phillimore at oracle.com wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 11/15/19 11:17 PM, serguei.spitsyn at oracle.com wrote:
>>>>>>>>>>> Hi Coleen,
>>>>>>>>>>>
>>>>>>>>>>> On 11/15/19 2:12 PM, coleen.phillimore at oracle.com wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi, I've been working on answers to these questions, so 
>>>>>>>>>>>> I'll start with this one.
>>>>>>>>>>>>
>>>>>>>>>>>> The nmethodLocker keeps the nmethod from being reclaimed 
>>>>>>>>>>>> (made_zombie or memory released) by the sweeper, but the 
>>>>>>>>>>>> nmethod could be unloaded.  Unloading the nmethod clears 
>>>>>>>>>>>> the Method* _method field.
>>>>>>>>>>>
>>>>>>>>>>> Yes, I see it is done in the nmethod::make_unloaded().
>>>>>>>>>>>
>>>>>>>>>>>> The post_compiled_method_load event needs the _method field 
>>>>>>>>>>>> to look at things like inlining and ScopeDesc fields.   If 
>>>>>>>>>>>> the nmethod is unloaded, some of the oops are dead.  There 
>>>>>>>>>>>> are "holder" oops that correspond to the metadata in the 
>>>>>>>>>>>> nmethod. If these oops are dead, causing the nmethod to get 
>>>>>>>>>>>> unloaded, then the metadata may not be valid.
>>>>>>>>>>>>
>>>>>>>>>>>> So my change 02 looks for a NULL nmethod._method field to 
>>>>>>>>>>>> tell whether we can post information about the nmethod.
>>>>>>>>>>>>
>>>>>>>>>>>> There's code in nmethod.cpp like:
>>>>>>>>>>>>
>>>>>>>>>>>> jmethodID nmethod::get_and_cache_jmethod_id() {
>>>>>>>>>>>>   if (_jmethod_id == NULL) {
>>>>>>>>>>>>     // Cache the jmethod_id since it can no longer be 
>>>>>>>>>>>> looked up once the
>>>>>>>>>>>>     // method itself has been marked for unloading.
>>>>>>>>>>>>     _jmethod_id = method()->jmethod_id();
>>>>>>>>>>>>   }
>>>>>>>>>>>>   return _jmethod_id;
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> Which was added when post_method_load and unload were 
>>>>>>>>>>>> turned into deferred events.
>>>>>>>>>>>
>>>>>>>>>>> Could we cache the jmethodID in the 
>>>>>>>>>>> JvmtiDeferredEvent::compiled_method_load_event
>>>>>>>>>>> similarly as we do in the 
>>>>>>>>>>> JvmtiDeferredEvent::compiled_method_unload_event?
>>>>>>>>>>> This would help to get rid of the dependency on the 
>>>>>>>>>>> nmethod::_method.
>>>>>>>>>>> Do we depend on any other nmethod fields?
>>>>>>>>>>
>>>>>>>>>> Yes, there are other nmethod metadata that we rely on to 
>>>>>>>>>> print inline information, and this function 
>>>>>>>>>> JvmtiCodeBlobEvents::build_jvmti_addr_location_map because it 
>>>>>>>>>> uses the ScopeDesc data in the nmethod.
>>>>>>>>>
>>>>>>>>> One possible approach is to prepare and cache all this information
>>>>>>>>> in the nmethod::post_compiled_method_load_event() before the
>>>>>>>>> JvmtiDeferredEvent::compiled_method_load_event() is called.
>>>>>>>>> The event parameters are:
>>>>>>>>> typedef struct {
>>>>>>>>>      const void* start_address;
>>>>>>>>>      jlocation location;
>>>>>>>>> } jvmtiAddrLocationMap;
>>>>>>>>> CompiledMethodLoad(jvmtiEnv *jvmti_env,
>>>>>>>>>              jmethodID method,
>>>>>>>>>              jint code_size,
>>>>>>>>>              const void* code_addr,
>>>>>>>>>              jint map_length,
>>>>>>>>>              const jvmtiAddrLocationMap* map,
>>>>>>>>>              const void* compile_info)
>>>>>>>>> Some of these addresses above could be not accessible when an 
>>>>>>>>> event is posted.
>>>>>>>>> Not sure yet if it is Okay.
>>>>>>>>> The question is if this kind of refactoring is worth and right 
>>>>>>>>> thing to do.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> We do cache the jmethodID but that's not good enough.  See my 
>>>>>>>>>> last comment in the bug report. The jmethodID can point to an 
>>>>>>>>>> unloaded method.
>>>>>>>>>
>>>>>>>>> This looks like it is done a little bit late.
>>>>>>>>> It'd better to do it before the event is deferred (see above).
>>>>>>>>>
>>>>>>>>>> I tried a version of keeping the nmethod alive, but the GC 
>>>>>>>>>> folks will hate it.  And it doesn't work and I hate it.
>>>>>>>>>
>>>>>>>>> From serviceability point of view this is the best and most 
>>>>>>>>> consistent approach.
>>>>>>>>> I seems to me, it was initially designed this way.
>>>>>>>>> The downside is it adds some extra complexity to the GC.
>>>>>>>>>
>>>>>>>>>> My version 01 is the best, with the caveat that maybe it 
>>>>>>>>>> should check for _method == NULL instead of 
>>>>>>>>>> nmethod->is_alive().  I have to talk to Erik to see if 
>>>>>>>>>> there's a race with concurrent class unloading.
>>>>>>>>>>
>>>>>>>>>> Any application that depends on a compiled method loading 
>>>>>>>>>> event on a class that could be unloaded is a buggy 
>>>>>>>>>> application.  Applications should not rely on when the JIT 
>>>>>>>>>> compiler decides to compile a method!  This happens to us for 
>>>>>>>>>> a stress test.  Most applications will get most of their 
>>>>>>>>>> compiled method loading events as they normally do.
>>>>>>>>>
>>>>>>>>> It is not an application that relies on the compiled method 
>>>>>>>>> loading event.
>>>>>>>>> It is about profiling tools to be able to get correct 
>>>>>>>>> information about what is going on with compilations.
>>>>>>>>> My concern is that if we skip such compiled method load events 
>>>>>>>>> then profilers have no way
>>>>>>>>> to find out there many unneeded compilations that are thrown 
>>>>>>>>> away without any real use.
>>>>>>>>> Also, it is not clear what happens with the subsequent 
>>>>>>>>> compiled method unload events.
>>>>>>>>> Are they going to be skipped as well or they can appear and 
>>>>>>>>> confuse profilers?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Serguei
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Coleen
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Serguei
>>>>>>>>>>>
>>>>>>>>>>>> I put more debugging in the bug to show this crash was from 
>>>>>>>>>>>> an unloaded nmethod.
>>>>>>>>>>>>
>>>>>>>>>>>> Coleen
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 11/15/19 4:45 PM, serguei.spitsyn at oracle.com wrote:
>>>>>>>>>>>>> Hi Coleen,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have some questions.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Both the compiler method load and unload are posted as 
>>>>>>>>>>>>> deferred events.
>>>>>>>>>>>>> Both events keep the nmethod alive until the ServiceThread 
>>>>>>>>>>>>> processes the event.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The implementation is:
>>>>>>>>>>>>>
>>>>>>>>>>>>> JvmtiDeferredEvent 
>>>>>>>>>>>>> JvmtiDeferredEvent::compiled_method_load_event(nmethod* nm) {
>>>>>>>>>>>>>   . . .
>>>>>>>>>>>>>   // Keep the nmethod alive until the ServiceThread can 
>>>>>>>>>>>>> process
>>>>>>>>>>>>>   // this deferred event.
>>>>>>>>>>>>>   nmethodLocker::lock_nmethod(nm);
>>>>>>>>>>>>>   return event;
>>>>>>>>>>>>> }
>>>>>>>>>>>>>
>>>>>>>>>>>>> JvmtiDeferredEvent 
>>>>>>>>>>>>> JvmtiDeferredEvent::compiled_method_unload_event(nmethod* 
>>>>>>>>>>>>> nm, jmethodID id, const void* code) {
>>>>>>>>>>>>>   . . .
>>>>>>>>>>>>>   // Keep the nmethod alive until the ServiceThread can 
>>>>>>>>>>>>> process
>>>>>>>>>>>>>   // this deferred event. This will keep the memory for the
>>>>>>>>>>>>>   // generated code from being reused too early. We pass
>>>>>>>>>>>>>   // zombie_ok == true here so that our nmethod that was just
>>>>>>>>>>>>>   // made into a zombie can be locked.
>>>>>>>>>>>>>   nmethodLocker::lock_nmethod(nm, true /* zombie_ok */);
>>>>>>>>>>>>>   return event;
>>>>>>>>>>>>> }
>>>>>>>>>>>>>
>>>>>>>>>>>>> void JvmtiDeferredEvent::post() {
>>>>>>>>>>>>> assert(ServiceThread::is_service_thread(Thread::current()),
>>>>>>>>>>>>>          "Service thread must post enqueued events");
>>>>>>>>>>>>>   switch(_type) {
>>>>>>>>>>>>>     case TYPE_COMPILED_METHOD_LOAD: {
>>>>>>>>>>>>>       nmethod* nm = _event_data.compiled_method_load;
>>>>>>>>>>>>> JvmtiExport::post_compiled_method_load(nm);
>>>>>>>>>>>>>       // done with the deferred event so unlock the nmethod
>>>>>>>>>>>>>       nmethodLocker::unlock_nmethod(nm);
>>>>>>>>>>>>>       break;
>>>>>>>>>>>>>     }
>>>>>>>>>>>>>     case TYPE_COMPILED_METHOD_UNLOAD: {
>>>>>>>>>>>>>       nmethod* nm = _event_data.compiled_method_unload.nm;
>>>>>>>>>>>>> JvmtiExport::post_compiled_method_unload(
>>>>>>>>>>>>> _event_data.compiled_method_unload.method_id,
>>>>>>>>>>>>> _event_data.compiled_method_unload.code_begin);
>>>>>>>>>>>>>       // done with the deferred event so unlock the nmethod
>>>>>>>>>>>>>       nmethodLocker::unlock_nmethod(nm);
>>>>>>>>>>>>>       break;
>>>>>>>>>>>>>     }
>>>>>>>>>>>>>     . . .
>>>>>>>>>>>>>   }
>>>>>>>>>>>>> }
>>>>>>>>>>>>>
>>>>>>>>>>>>> Then I wonder how is it possible for the nmethod to be not 
>>>>>>>>>>>>> alive here?:
>>>>>>>>>>>>> 2168 void JvmtiExport::post_compiled_method_load(nmethod *nm) {
>>>>>>>>>>>>> . . .
>>>>>>>>>>>>> 2173 // It's not safe to look at metadata for unloaded 
>>>>>>>>>>>>> methods.
>>>>>>>>>>>>> 2174 if (!nm->is_alive()) {
>>>>>>>>>>>>> 2175 return;
>>>>>>>>>>>>> 2176 }
>>>>>>>>>>>>> At least, it lokks like something else is broken.
>>>>>>>>>>>>> Do I miss something important here?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Serguei
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 11/14/19 5:15 PM, coleen.phillimore at oracle.com wrote:
>>>>>>>>>>>>>> Summary: Don't post information which uses metadata from 
>>>>>>>>>>>>>> unloaded nmethods
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Tested tier1-3 and 100 times with test that failed 
>>>>>>>>>>>>>> (reproduced failure without the fix).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> open webrev at 
>>>>>>>>>>>>>> http://cr.openjdk.java.net/~coleenp/2019/8173361.01/webrev
>>>>>>>>>>>>>> bug link https://bugs.openjdk.java.net/browse/JDK-8173361
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Coleen
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.java.net/pipermail/serviceability-dev/attachments/20191202/a06566d1/attachment-0001.html>