Stack walking performance issue

Mon Mar 25 09:33:46 UTC 2019

On 2019-03-25 10:32, Amir Hadadi wrote:
> The ResolvedMethod count output is from G1 + OpenJDK 11.

OK. Good to know.

Thanks,
StefanK

> ________________________________
> From: Stefan Karlsson <stefan.karlsson at oracle.com>
> Sent: Monday, March 25, 2019 10:53 AM
> To: Amir Hadadi; zgc-dev at openjdk.java.net
> Subject: Re: Stack walking performance issue
> 
> Hi Amir,
> 
> On 2019-03-25 09:38, Amir Hadadi wrote:
>> Hi Stefan,
>> I ran some of the tests you suggested.
> 
> Thanks for doing these tests.
> 
>> 1) G1 with -XX:-ClassUnloading shows the same issue as ZGC, top stack traces lead to java_lang_invoke_ResolvedMethodName::find_resolved_method(methodHandle const&, Thread*).
> 
> I think G1 will fill up the table, and at some point plateau when
> all/most methods have been recorded in the table. For ZGC, with the bug
> in JDK 11, we'll never reach a steady state and will forever keep on
> increasing the hash table.
> 
>> 2) With class unloading enabled, I see the following being printed:
>>
>> service.gc.0:[24304.545s][debug][membername,table] ResolvedMethod entries counted 94768 removed 92352
>> service.gc.0:[36944.749s][debug][membername,table] ResolvedMethod entries counted 64041 removed 61616
>> service.gc.0:[36944.749s][debug][membername,table] ResolvedMethod entries counted 64041 removed 61616
>> service.gc.3:[42177.904s][debug][membername,table] ResolvedMethod entries counted 19346 removed 16930
>> service.gc.1:[47231.731s][debug][membername,table] ResolvedMethod entries counted 20110 removed 17694
>> service.gc.1:[47231.731s][debug][membername,table] ResolvedMethod entries counted 20110 removed 17694
>> service.gc.3:[52267.616s][debug][membername,table] ResolvedMethod entries counted 27426 removed 25010
>> service.gc.1:[57078.463s][debug][membername,table] ResolvedMethod entries counted 30835 removed 28419
>> service.gc.3:[61907.835s][debug][membername,table] ResolvedMethod entries counted 32403 removed 29987
>> service.gc.2:[66916.223s][debug][membername,table] ResolvedMethod entries counted 34545 removed 32129
>> service.gc.2:[71786.212s][debug][membername,table] ResolvedMethod entries counted 39112 removed 36696
>> service.gc.2:[76619.460s][debug][membername,table] ResolvedMethod entries counted 34305 removed 31889
> 
> Is this output from G1 or ZGC + JDK 12?
> 
> This output shows that the table is too small. I've created a
> hotspot/runtime bug report for this issue:
> https://bugs.openjdk.java.net/browse/JDK-8221393
> 
> Thanks,
> StefanK
> 
> 
>>
>> ________________________________
>> From: Stefan Karlsson <stefan.karlsson at oracle.com>
>> Sent: Monday, March 18, 2019 1:33 PM
>> To: Amir Hadadi; zgc-dev at openjdk.java.net
>> Subject: Re: Stack walking performance issue
>>
>> On 2019-03-18 10:46, Amir Hadadi wrote:
>>> Thanks Stefan and Alexey!
>>> As the app is not loading more and more classes, I was not expecting not having class unloading to be an issue.
>>> Can you please explain what's the ResolvedMethodTable and what can cause it to fill up?
>>
>> I'm not too familiar with that code, but it looks like it creates an
>> entry in the table for each different java stack frame it encounters in
>> the stack walking code.
>>
>> There are comments about the ResolvedMethodName here:
>> http://hg.openjdk.java.net/jdk/jdk/file/5e1480a38a43/src/java.base/share/classes/java/lang/invoke/MemberName.java#l46
>>
>>> Can it keep filling up even when classes are not being constantly loaded?
>>
>> A GC that performs class unloading will actually do more cleaning work
>> than unloading classes. Cleaning ResolvedMethodTable is one. Cleaning
>> out unused entries in the StringTable is another.
>>
>> The ResolvedMethodTable holds "weak" pointers to ResolvedMethodNames
>> java objects. If a ResolvedNMethodName is not strongly reachable, the GC
>> will remove that entry (if class unloading is enabled). The
>> ResolvedMethodName is installed in MemberName java objects (see code
>> above). The stack frame code installs the MemberName object in
>> StackFrameInfo.memberName here:
>>
>> void java_lang_StackFrameInfo::set_method_and_bci(Handle stackFrame,
>> const methodHandle& method, int bci, TRAPS) {
>> ...
>>      // Create and install ResolvedMethodName in ResolvedMethodTable
>>      CallInfo info(method(), ik, CHECK);
>>      // Installs ResolveMethodName in MemberName
>>      MethodHandles::init_method_MemberName(mname, info);
>>
>> And from your stack trace, the StackFrameInfo objects gets fetched here:
>>
>>     java_lang_StackFrameInfo::set_method_and_bci(Handle, methodHandle
>> const&, int, Thread*)
>>     JavaFrameStream::fill_frame(int, objArrayHandle, methodHandle const&,
>> Thread*)
>>     StackWalk::fill_in_frames(long, BaseFrameStream&, int, int,
>> objArrayHandle, int&, Thread*)
>>     StackWalk::fetchNextBatch(Handle, long, long, int, int,
>> objArrayHandle, Thread*)
>>     JVM_MoreStackWalk
>>     java.lang.StackStreamFactory$AbstractStackWalker.fetchStackFrames
>>
>> I guess that these StackFrameInfos are used further down in that code,
>> and when they're not needed any more they will be dropped from the
>> object graph. When that happens there will be no other path to the
>> MemberName, and no path to the ResolvedMethodName, and the next class
>> unloading GC will clear the entry in the ResolvedMethodTable.
>>
>> In ZGC (in JDK 11), we actually do clean out the "weak" pointers in the
>> ResolvedMethodTable, but we do not remove the entries. Therefore, if you
>> do a lot of stack walking and throw away the StackFrameInfos, the
>> ResolvedMethodTable will fill up with cleared entries pointing to NULL.
>>
>> So, in short, using the stack walking API with the JDK 11 version of
>> ZGC, has the performance problem you've reported.
>>
>>>
>>> I will run the experiments you suggested and report back.
>>
>> Thanks!
>> StefanK
>>
>>> ________________________________
>>> From: Stefan Karlsson <stefan.karlsson at oracle.com>
>>> Sent: Monday, March 18, 2019 10:54 AM
>>> To: Amir Hadadi; zgc-dev at openjdk.java.net
>>> Subject: Re: Stack walking performance issue
>>>
>>> Hi Amir,
>>>
>>> Thanks for testing ZGC and reporting back this issue.
>>>
>>> The stack trace indicates that the JVM is performing lookups in the
>>> ResolvedMethodTable:
>>>
>>>      ResolvedMethodTable::lookup(int, unsigned int, Method*)
>>>      java_lang_invoke_ResolvedMethodName::find_resolved_method(methodHandle
>>> const&, Thread*)
>>> CallInfo::CallInfo(Method*, Klass*, Thread*)
>>>
>>> This table has a small, fixed size:
>>>
>>>       enum Constants {
>>>         _table_size  = 1007
>>>       };
>>>
>>> and it gets cleaned out during class unloading.
>>>
>>> ZGC doesn't have class unloading in JDK 11. In JDK 12, which is about to
>>> go GA tomorrow [1], we've implemented concurrent class unloading for
>>> ZGC. So, that version will clean out the ResolvedMethodTable.
>>>
>>> It would be good to verify that this is indeed the problem. Two things
>>> that could be tested:
>>>
>>> 1) Run with G1 and class unloading disabled with -XX:-ClassUnloading. If
>>> this is showing the same problem as ZGC, then we're on the right track.
>>>
>>> 2) Try with ZGC on JDK 12.
>>>
>>> When running with class unloading, you could also try to add
>>> membername+table=debug to the -Xlog flag. For example:
>>>
>>>      -Xlog:gc*,membername+table=debug
>>>
>>> This will print the following line every time the GC cleans the
>>> ResolvedMemberTable during class unloading:
>>>
>>>       log_debug(membername, table) ("ResolvedMethod entries counted %d
>>> removed %d", _oops_counted, _oops_removed);
>>>
>>> This will show us how full this table is, and if it gets cleaned a lot.
>>> If it's heavily populated even with class unloading, we probably need to
>>> change it's implementation to use the new ConcurrentHashTable.
>>>
>>> Thanks,
>>> StefanK
>>>
>>> [1] https://openjdk.java.net/projects/jdk/12/
>>>
>>> On 2019-03-17 15:43, Amir Hadadi wrote:
>>>> We've encountered the following performance issue which happens on an instance deployed in docker with ubuntu 16.04, Linux kernel 4.4.0-92-generic and OpenJDK 11.0.2.
>>>> The issue shows up with zgc but does not show up with G1.
>>>>
>>>> During a period of 10 days after deployment, CPU usage goes up steadily at a rate of ~10% per day. Eventually we have to restart all instances.
>>>> I profiled our app after 4 days of uptime using async-profiler and found that the following stack is the most frequent stack: https://gist.github.com/amirhadadi/48b6f84e3b2412124e817a50608e6ddd
>>>> I tried restarting the instance and waited 10 minutes before profiling, and stack walking shows up much less in sampling: https://gist.github.com/amirhadadi/0c43b087b9bfd995119a97cbf3557d21
>>>> This is how the stack walk looks when profiling an instance deployed with g1 after 3 days: https://gist.github.com/amirhadadi/224c33a19bfd9ea8dcc264cefc641496
>>>>
>>>> Please help me figure this one out.
>>>>
>>>>
>>>>
>>>>
>>>>