RFR (S) 8173361: various crashes in JvmtiExport::post_compiled_method_load
coleen.phillimore at oracle.com
coleen.phillimore at oracle.com
Fri Nov 22 23:16:13 UTC 2019
Thanks for the re-review, Dan!
Coleen
On 11/22/19 5:53 PM, Daniel D. Daugherty wrote:
>> http://cr.openjdk.java.net/~coleenp/2019/8173361.04.incr/webrev
>
> src/hotspot/share/prims/jvmtiImpl.cpp
> No comments.
>
> src/hotspot/share/prims/jvmtiImpl.hpp
> No comments.
>
> src/hotspot/share/runtime/serviceThread.cpp
> No comments.
>
> Thumbs up.
>
> Dan
>
>
> On 11/22/19 2:15 PM, coleen.phillimore at oracle.com wrote:
>>
>> Dan, Thank you for reviewing this!
>>
>> On 11/22/19 12:49 PM, Daniel D. Daugherty wrote:
>>> Hi Coleen,
>>>
>>> Sorry for the delay in getting back to this re-review.
>>>
>>>
>>> On 11/21/19 9:12 AM, coleen.phillimore at oracle.com wrote:
>>>>
>>>> Please review a new version of this change that keeps the nmethod
>>>> from being unloaded, after it is added to the deferred event queue:
>>>>
>>>> http://cr.openjdk.java.net/~coleenp/2019/8173361.03/webrev/index.html
>>>
>>> src/hotspot/share/code/nmethod.cpp
>>> No comments.
>>>
>>> src/hotspot/share/oops/instanceKlass.cpp
>>> No comments.
>>>
>>> src/hotspot/share/prims/jvmtiExport.cpp
>>> No comments.
>>>
>>> src/hotspot/share/prims/jvmtiImpl.cpp
>>> Nice solution with the new oops_do() and nmethods_do() functions!
>> Erik's insistance!
>>>
>>> old L988: void JvmtiDeferredEventQueue::enqueue(const
>>> JvmtiDeferredEvent& event) {
>>> new L998: void
>>> JvmtiDeferredEventQueue::enqueue(JvmtiDeferredEvent event) {
>>> Not sure why this was changed.
>>>
>>> Update: Looks like Serguei raised the issue and Coleen has
>>> already
>>> resolved it.
>>
>> Yes.
>>>
>>> src/hotspot/share/prims/jvmtiImpl.hpp
>>> old L494: QueueNode(const JvmtiDeferredEvent& event)
>>> new L498: QueueNode(JvmtiDeferredEvent& event)
>>> Why was this changed?
>>>
>>> Update: Not clear if this was covered by Coleen's reply to
>>> Serguei.
>>>
>>> old L497: const JvmtiDeferredEvent& event() const { return
>>> _event; }
>>> new L501: JvmtiDeferredEvent& event() { return _event; }
>>> Why was this changed?
>>>
>>> Update: Coleen's reply to Serguei explained this. Perhaps add:
>>> // Not const because of oops_do() and nmethods_do().
>>>
>>> old L509: static void enqueue(const JvmtiDeferredEvent& event)
>>> NOT_JVMTI_RETURN;
>>> new L513: static void enqueue(JvmtiDeferredEvent event)
>>> NOT_JVMTI_RETURN;
>>> Why was this changed?
>>>
>>> Update: Looks like Serguei raised the issue and Coleen has
>>> already
>>> resolved it.
>>
>> Yes, I fixed these.
>>>
>>> src/hotspot/share/runtime/mutexLocker.cpp
>>> This change is going to require some testing to make sure we don't
>>> have any new deadlock scenarios.
>>
>> Luckily, I've previously added an implicit NoSafepointVerifier to
>> locks that are _allow_vm_block = true, like this one.
>> + def(JmethodIdCreation_lock , PaddedMutex , leaf, true,
>> _safepoint_check_never); // used for creating jmethodIDs.
>> which prevents one class of deadlock. If we take out another lock
>> with a higher rank, we'll get the ranking assert.
>>
>> This lock prevents insertion into an array, and has little outside calls.
>>
>> I'm running tests in tier 1-6 but any code that travels through this
>> should get these assertion checks, rather than deadlocking.
>>
>>>
>>> src/hotspot/share/runtime/serviceThread.cpp
>>> L50 - nit - why the extra blank line?
>>
>> To separate static data member definitions from functions. I removed it.
>>>
>>> src/hotspot/share/runtime/serviceThread.hpp
>>> Thanks for cleaning up the static:
>>>
>>> ServiceThread::is_service_thread(Thread* thread)
>>>
>>> stuff. Having it be different than the other threads was
>>> a bit jarring.
>>>
>>> src/hotspot/share/runtime/thread.hpp
>>> No comments.
>>>
>>> Thumbs up. My only comments are nits so I don't need to see a
>>> new webrev if you decide to fix them.
>>
>> So it turns out that in stress testing my fix
>> forhttps://bugs.openjdk.java.net/browse/JDK-8212160
>>
>> Because I was in the area and thought this was a duplicate of that
>> bug (it is not). I found that calling oops_do and nmethods_do the
>> ServiceThread needs to hold the Service_lock, because other threads
>> can be adding things to the global queue while the sweeper thread is
>> calling this in a handshake.
>>
>> I am now retesting this change with the changes above, and with the
>> Service_lock. So far my stress tests for JDK-81212160 and the
>> stress test for this bug pass, but I'm going to run through all the
>> tiers 1-6 over the weekend.
>>
>> Please have a look at the changes in the meantime.
>>
>> http://cr.openjdk.java.net/~coleenp/2019/8173361.04.incr/webrev
>> http://cr.openjdk.java.net/~coleenp/2019/8173361.04/webrev
>>
>> Thanks,
>> Coleen
>>>
>>> Dan
>>>
>>>>
>>>> Ran the test that failed 100 times without failure, tier1 on Oracle
>>>> supported platforms, and tier2-3 including jvmti and jdi tests locally.
>>>>
>>>> See bug for more details about the crash.
>>>>
>>>> https://bugs.openjdk.java.net/browse/JDK-8173361
>>>>
>>>> Thanks,
>>>> Coleen
>>>>
>>>> On 11/18/19 10:09 PM, coleen.phillimore at oracle.com wrote:
>>>>>
>>>>> Hi Serguei,
>>>>>
>>>>> Sorry for not sending an update. I talked to Erik and am working
>>>>> on a version that keeps the nmethod from being unloaded while it's
>>>>> in the deferred event queue, with a version that the GC people
>>>>> will like, and I like. I'm testing it out now.
>>>>>
>>>>> Thanks!
>>>>> Coleen
>>>>>
>>>>>
>>>>> On 11/18/19 10:03 PM, serguei.spitsyn at oracle.com wrote:
>>>>>> Hi Coleen,
>>>>>>
>>>>>> Sorry for the latency, I had to investigate it a little bit.
>>>>>> I still have some doubt your fix is right thing to do.
>>>>>>
>>>>>>
>>>>>> On 11/16/19 04:55, coleen.phillimore at oracle.com wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 11/15/19 11:17 PM, serguei.spitsyn at oracle.com wrote:
>>>>>>>> Hi Coleen,
>>>>>>>>
>>>>>>>> On 11/15/19 2:12 PM, coleen.phillimore at oracle.com wrote:
>>>>>>>>>
>>>>>>>>> Hi, I've been working on answers to these questions, so I'll
>>>>>>>>> start with this one.
>>>>>>>>>
>>>>>>>>> The nmethodLocker keeps the nmethod from being reclaimed
>>>>>>>>> (made_zombie or memory released) by the sweeper, but the
>>>>>>>>> nmethod could be unloaded. Unloading the nmethod clears the
>>>>>>>>> Method* _method field.
>>>>>>>>
>>>>>>>> Yes, I see it is done in the nmethod::make_unloaded().
>>>>>>>>
>>>>>>>>> The post_compiled_method_load event needs the _method field to
>>>>>>>>> look at things like inlining and ScopeDesc fields. If the
>>>>>>>>> nmethod is unloaded, some of the oops are dead. There are
>>>>>>>>> "holder" oops that correspond to the metadata in the nmethod.
>>>>>>>>> If these oops are dead, causing the nmethod to get unloaded,
>>>>>>>>> then the metadata may not be valid.
>>>>>>>>>
>>>>>>>>> So my change 02 looks for a NULL nmethod._method field to tell
>>>>>>>>> whether we can post information about the nmethod.
>>>>>>>>>
>>>>>>>>> There's code in nmethod.cpp like:
>>>>>>>>>
>>>>>>>>> jmethodID nmethod::get_and_cache_jmethod_id() {
>>>>>>>>> if (_jmethod_id == NULL) {
>>>>>>>>> // Cache the jmethod_id since it can no longer be looked
>>>>>>>>> up once the
>>>>>>>>> // method itself has been marked for unloading.
>>>>>>>>> _jmethod_id = method()->jmethod_id();
>>>>>>>>> }
>>>>>>>>> return _jmethod_id;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> Which was added when post_method_load and unload were turned
>>>>>>>>> into deferred events.
>>>>>>>>
>>>>>>>> Could we cache the jmethodID in the
>>>>>>>> JvmtiDeferredEvent::compiled_method_load_event
>>>>>>>> similarly as we do in the
>>>>>>>> JvmtiDeferredEvent::compiled_method_unload_event?
>>>>>>>> This would help to get rid of the dependency on the
>>>>>>>> nmethod::_method.
>>>>>>>> Do we depend on any other nmethod fields?
>>>>>>>
>>>>>>> Yes, there are other nmethod metadata that we rely on to print
>>>>>>> inline information, and this function
>>>>>>> JvmtiCodeBlobEvents::build_jvmti_addr_location_map because it
>>>>>>> uses the ScopeDesc data in the nmethod.
>>>>>>
>>>>>> One possible approach is to prepare and cache all this information
>>>>>> in the nmethod::post_compiled_method_load_event() before the
>>>>>> JvmtiDeferredEvent::compiled_method_load_event() is called.
>>>>>> The event parameters are:
>>>>>> typedef struct {
>>>>>> const void* start_address;
>>>>>> jlocation location;
>>>>>> } jvmtiAddrLocationMap;
>>>>>> CompiledMethodLoad(jvmtiEnv *jvmti_env,
>>>>>> jmethodID method,
>>>>>> jint code_size,
>>>>>> const void* code_addr,
>>>>>> jint map_length,
>>>>>> const jvmtiAddrLocationMap* map,
>>>>>> const void* compile_info)
>>>>>> Some of these addresses above could be not accessible when an
>>>>>> event is posted.
>>>>>> Not sure yet if it is Okay.
>>>>>> The question is if this kind of refactoring is worth and right
>>>>>> thing to do.
>>>>>>
>>>>>>>
>>>>>>> We do cache the jmethodID but that's not good enough. See my
>>>>>>> last comment in the bug report. The jmethodID can point to an
>>>>>>> unloaded method.
>>>>>>
>>>>>> This looks like it is done a little bit late.
>>>>>> It'd better to do it before the event is deferred (see above).
>>>>>>
>>>>>>> I tried a version of keeping the nmethod alive, but the GC folks
>>>>>>> will hate it. And it doesn't work and I hate it.
>>>>>>
>>>>>> From serviceability point of view this is the best and most
>>>>>> consistent approach.
>>>>>> I seems to me, it was initially designed this way.
>>>>>> The downside is it adds some extra complexity to the GC.
>>>>>>
>>>>>>> My version 01 is the best, with the caveat that maybe it should
>>>>>>> check for _method == NULL instead of nmethod->is_alive(). I
>>>>>>> have to talk to Erik to see if there's a race with concurrent
>>>>>>> class unloading.
>>>>>>>
>>>>>>> Any application that depends on a compiled method loading event
>>>>>>> on a class that could be unloaded is a buggy application.
>>>>>>> Applications should not rely on when the JIT compiler decides to
>>>>>>> compile a method! This happens to us for a stress test. Most
>>>>>>> applications will get most of their compiled method loading
>>>>>>> events as they normally do.
>>>>>>
>>>>>> It is not an application that relies on the compiled method
>>>>>> loading event.
>>>>>> It is about profiling tools to be able to get correct information
>>>>>> about what is going on with compilations.
>>>>>> My concern is that if we skip such compiled method load events
>>>>>> then profilers have no way
>>>>>> to find out there many unneeded compilations that are thrown away
>>>>>> without any real use.
>>>>>> Also, it is not clear what happens with the subsequent compiled
>>>>>> method unload events.
>>>>>> Are they going to be skipped as well or they can appear and
>>>>>> confuse profilers?
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Serguei
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Coleen
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Serguei
>>>>>>>>
>>>>>>>>> I put more debugging in the bug to show this crash was from an
>>>>>>>>> unloaded nmethod.
>>>>>>>>>
>>>>>>>>> Coleen
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 11/15/19 4:45 PM, serguei.spitsyn at oracle.com wrote:
>>>>>>>>>> Hi Coleen,
>>>>>>>>>>
>>>>>>>>>> I have some questions.
>>>>>>>>>>
>>>>>>>>>> Both the compiler method load and unload are posted as
>>>>>>>>>> deferred events.
>>>>>>>>>> Both events keep the nmethod alive until the ServiceThread
>>>>>>>>>> processes the event.
>>>>>>>>>>
>>>>>>>>>> The implementation is:
>>>>>>>>>>
>>>>>>>>>> JvmtiDeferredEvent
>>>>>>>>>> JvmtiDeferredEvent::compiled_method_load_event(nmethod* nm) {
>>>>>>>>>> . . .
>>>>>>>>>> // Keep the nmethod alive until the ServiceThread can process
>>>>>>>>>> // this deferred event.
>>>>>>>>>> nmethodLocker::lock_nmethod(nm);
>>>>>>>>>> return event;
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> JvmtiDeferredEvent
>>>>>>>>>> JvmtiDeferredEvent::compiled_method_unload_event(nmethod* nm,
>>>>>>>>>> jmethodID id, const void* code) {
>>>>>>>>>> . . .
>>>>>>>>>> // Keep the nmethod alive until the ServiceThread can process
>>>>>>>>>> // this deferred event. This will keep the memory for the
>>>>>>>>>> // generated code from being reused too early. We pass
>>>>>>>>>> // zombie_ok == true here so that our nmethod that was just
>>>>>>>>>> // made into a zombie can be locked.
>>>>>>>>>> nmethodLocker::lock_nmethod(nm, true /* zombie_ok */);
>>>>>>>>>> return event;
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> void JvmtiDeferredEvent::post() {
>>>>>>>>>> assert(ServiceThread::is_service_thread(Thread::current()),
>>>>>>>>>> "Service thread must post enqueued events");
>>>>>>>>>> switch(_type) {
>>>>>>>>>> case TYPE_COMPILED_METHOD_LOAD: {
>>>>>>>>>> nmethod* nm = _event_data.compiled_method_load;
>>>>>>>>>> JvmtiExport::post_compiled_method_load(nm);
>>>>>>>>>> // done with the deferred event so unlock the nmethod
>>>>>>>>>> nmethodLocker::unlock_nmethod(nm);
>>>>>>>>>> break;
>>>>>>>>>> }
>>>>>>>>>> case TYPE_COMPILED_METHOD_UNLOAD: {
>>>>>>>>>> nmethod* nm = _event_data.compiled_method_unload.nm;
>>>>>>>>>> JvmtiExport::post_compiled_method_unload(
>>>>>>>>>> _event_data.compiled_method_unload.method_id,
>>>>>>>>>> _event_data.compiled_method_unload.code_begin);
>>>>>>>>>> // done with the deferred event so unlock the nmethod
>>>>>>>>>> nmethodLocker::unlock_nmethod(nm);
>>>>>>>>>> break;
>>>>>>>>>> }
>>>>>>>>>> . . .
>>>>>>>>>> }
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> Then I wonder how is it possible for the nmethod to be not
>>>>>>>>>> alive here?:
>>>>>>>>>> 2168 void JvmtiExport::post_compiled_method_load(nmethod *nm) {
>>>>>>>>>> . . .
>>>>>>>>>> 2173 // It's not safe to look at metadata for unloaded methods.
>>>>>>>>>> 2174 if (!nm->is_alive()) {
>>>>>>>>>> 2175 return;
>>>>>>>>>> 2176 }
>>>>>>>>>> At least, it lokks like something else is broken.
>>>>>>>>>> Do I miss something important here?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Serguei
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 11/14/19 5:15 PM, coleen.phillimore at oracle.com wrote:
>>>>>>>>>>> Summary: Don't post information which uses metadata from
>>>>>>>>>>> unloaded nmethods
>>>>>>>>>>>
>>>>>>>>>>> Tested tier1-3 and 100 times with test that failed
>>>>>>>>>>> (reproduced failure without the fix).
>>>>>>>>>>>
>>>>>>>>>>> open webrev at
>>>>>>>>>>> http://cr.openjdk.java.net/~coleenp/2019/8173361.01/webrev
>>>>>>>>>>> bug link https://bugs.openjdk.java.net/browse/JDK-8173361
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Coleen
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.java.net/pipermail/serviceability-dev/attachments/20191122/b2ab2f76/attachment-0001.html>
More information about the serviceability-dev
mailing list