RFR (M) 8212160: JVMTI agent crashes with "assert(_value != 0LL) failed: resolving NULL _value"

Wed Dec 4 19:17:09 UTC 2019

On 12/4/19 04:21, coleen.phillimore at oracle.com wrote:
>
>
> On 12/3/19 11:39 PM, David Holmes wrote:
>>
>>
>> On 3/12/2019 11:35 pm, coleen.phillimore at oracle.com wrote:
>>>
>>>
>>> On 12/3/19 8:31 AM, David Holmes wrote:
>>>> On 3/12/2019 11:08 pm, coleen.phillimore at oracle.com wrote:
>>>>>
>>>>>
>>>>> On 12/2/19 11:52 PM, David Holmes wrote:
>>>>>> Hi Coleen,
>>>>>>
>>>>>> On 3/12/2019 12:43 am, coleen.phillimore at oracle.com wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 11/26/19 7:03 PM, David Holmes wrote:
>>>>>>>> (adding runtime as well)
>>>>>>>>
>>>>>>>> Hi Coleen,
>>>>>>>>
>>>>>>>> On 27/11/2019 12:22 am, coleen.phillimore at oracle.com wrote:
>>>>>>>>> Summary: Add local deferred event list to thread to post 
>>>>>>>>> events outside CodeCache_lock.
>>>>>>>>>
>>>>>>>>> This patch builds on the patch for JDK-8173361. With this 
>>>>>>>>> patch, I made the JvmtiDeferredEventQueue an instance class 
>>>>>>>>> (not AllStatic) and have one per thread. The CodeBlob event 
>>>>>>>>> that used to drop the CodeCache_lock and raced with the 
>>>>>>>>> sweeper thread, adds the events it wants to post to its thread 
>>>>>>>>> local list, and processes it outside the lock.  The list is 
>>>>>>>>> walked in GC and by the sweeper to keep the nmethods from 
>>>>>>>>> being unloaded and zombied, respectively.
>>>>>>>>
>>>>>>>> Sorry I don't understand why we would want/need a deferred 
>>>>>>>> event queue for every JavaThread? Isn't this only relevant for 
>>>>>>>> non-JavaThreads that need to have the ServiceThread process the 
>>>>>>>> deferred event?
>>>>>>>
>>>>>>> I thought I'd written this in the bug but I had only discussed 
>>>>>>> this with Erik.  I've added a comment to the bug to explain why 
>>>>>>> I added the per-JavaThread queue. In order to process these 
>>>>>>> events after the CodeCache_lock is dropped, I have to queue them 
>>>>>>> somewhere safe. The ServiceThread queue is safe, *but* the 
>>>>>>> ServiceThread can't keep up with the events, especially from 
>>>>>>> this test case.  So the test case gets a native OOM.
>>>>>>>
>>>>>>> So I've added the safe queue as a field to each JavaThread 
>>>>>>> because multiple JavaThreads could be posting these events at 
>>>>>>> the same time, and there didn't seem to be a better safe place 
>>>>>>> to cache them, without adding another layer of queuing code.
>>>>>>
>>>>>> I think I'm getting the picture now. At the time the events are 
>>>>>> generated we can't post them directly because the current thread 
>>>>>> is inside compiler code. Hence the events must be deferred. Using 
>>>>>> the ServiceThread to handle the deferred events is one way to 
>>>>>> deal with this - but it can't keep up in this scenario. So 
>>>>>> instead we store the events in the current thread and when the 
>>>>>> current thread returns to code where it is safe to post the 
>>>>>> events, it does so itself. Is that generally correct?
>>>>>
>>>>> Yes.
>>>>>>
>>>>>> I admit I'm not keen on adding this additional field per-thread 
>>>>>> just for a temporary usage. Some kind of stack allocated helper 
>>>>>> would be preferable, but would need to be passed through the call 
>>>>>> chain so that the events could be added to it.
>>>>>
>>>>> Right, and the GC and nmethods_do has to find it somehow. It 
>>>>> wasn't my first choice of where to put it also because there is 
>>>>> too many things in JavaThread.  Might be time for a future cleanup 
>>>>> of Thread.
>>>>
>>>> I see.
>>>>
>>>>>>
>>>>>> Also I'm not clear why we aggressively delete the 
>>>>>> _jvmti_event_queue after posting the events. I'd be worried about 
>>>>>> the overhead we are introducing for creating and deleting this 
>>>>>> queue. When the JvmtiDeferredEventQueue data structure was 
>>>>>> intended only for use by the ServiceThread its dynamic node 
>>>>>> allocation may have made more sense. But now that seems like a 
>>>>>> liability to me - if JvmtiDeferredEvents could be linked directly 
>>>>>> we wouldn't need dynamic nodes, nor dynamic per-thread queues 
>>>>>> (just a per-thread pointer).
>>>>>
>>>>> I'm not following.  The queue is for multiple events that might be 
>>>>> posted while in the CodeCache_lock, so they need to be in order 
>>>>> and linked together.  While we post them and take them off, if the 
>>>>> callback safepoints (maybe calls back into the JVM), we don't want 
>>>>> to have GC or nmethods_do walk the one that's been posted already. 
>>>>> So a queue seems to make sense.
>>>>
>>>> Yes but you can make a queue just by having each event have a _next 
>>>> pointer, rather than dynamically creating nodes to hold the event. 
>>>> Each event is its own queue node implicitly.
>>>>
>>>>> One thing that I experimented with was to have the ServiceThread 
>>>>> take ownership of the queue in it's local thread queue and post 
>>>>> them all, which could be a future enhancement.  It didn't help my 
>>>>> OOM situation.
>>>>
>>>> Your OOM situation seems to be a basic case of overwhelming the 
>>>> ServiceThread. A single serviceThread will always have a limit on 
>>>> how many events it can handle. Maybe this test is being too 
>>>> unrealistic in its expectations of the current design?
>>>
>>> I think the JVMTI API where you can generate an COMPILED_METHOD_LOAD 
>>> for all the events in the queue is going to be overwhelming unless 
>>> it waits for the events to be posted.
>>
>> Taking things off the service thread would seem to be a good thing 
>> then :)
>>
>>>>
>>>>> Deleting the queue after all the events are posted allows 
>>>>> JavaThread::oops_do and nmethods_do only a null check to deal with 
>>>>> this jvmti wart.
>>>>
>>>> If the nodes are not dynamically allocated you don't need to delete 
>>>> you just set the queue-head pointer to NULL - actually it will 
>>>> already be NULL once the last event has been processed.
>>>
>>> I could revisit the data structure as a future RFE.  The goal was to 
>>> reuse code that's already there, and I don't think there's a 
>>> significant difference in performance.  I did some measurement of 
>>> the stress case and the times were equivalent, actually better in 
>>> the new code.
>>
>> Okay.
>
> Is this a code review then?  I think Serguei promised to review the 
> code too.

Yes, I'm close to send my review soon.
Sorry for the latency.

Thanks,
Serguei

>
> thanks,
> Coleen
>>
>> Thanks,
>> David
>>
>>>
>>> Thanks,
>>> Coleen
>>>>
>>>> David
>>>> -----
>>>>
>>>>> Thanks,
>>>>> Coleen
>>>>>>
>>>>>> Just some thoughts.
>>>>>>
>>>>>> Thanks,
>>>>>> David
>>>>>>
>>>>>>> I did write comments to this effect here:
>>>>>>>
>>>>>>> http://cr.openjdk.java.net/~coleenp/2019/8212160.01/webrev/src/hotspot/share/prims/jvmtiCodeBlobEvents.cpp.udiff.html 
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Coleen
>>>>>>>
>>>>>>>>
>>>>>>>> David
>>>>>>>>
>>>>>>>>> Also, the jmethod_id field in nmethod was only used as a 
>>>>>>>>> boolean so don't create a jmethod_id until needed for 
>>>>>>>>> post_compiled_method_unload.
>>>>>>>>>
>>>>>>>>> Ran hs tier1-8 on linux-x64-debug and the stress test that 
>>>>>>>>> crashed in the original bug report.
>>>>>>>>>
>>>>>>>>> open webrev at 
>>>>>>>>> http://cr.openjdk.java.net/~coleenp/2019/8212160.01/webrev
>>>>>>>>> bug link https://bugs.openjdk.java.net/browse/JDK-8212160
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Coleen
>>>>>>>
>>>>>
>>>
>