Stack walking performance issue

Amir Hadadi amirhadadi at hotmail.com
Mon Mar 25 09:32:30 UTC 2019


The ResolvedMethod count output is from G1 + OpenJDK 11.
________________________________
From: Stefan Karlsson <stefan.karlsson at oracle.com>
Sent: Monday, March 25, 2019 10:53 AM
To: Amir Hadadi; zgc-dev at openjdk.java.net
Subject: Re: Stack walking performance issue

Hi Amir,

On 2019-03-25 09:38, Amir Hadadi wrote:
> Hi Stefan,
> I ran some of the tests you suggested.

Thanks for doing these tests.

> 1) G1 with -XX:-ClassUnloading shows the same issue as ZGC, top stack traces lead to java_lang_invoke_ResolvedMethodName::find_resolved_method(methodHandle const&, Thread*).

I think G1 will fill up the table, and at some point plateau when
all/most methods have been recorded in the table. For ZGC, with the bug
in JDK 11, we'll never reach a steady state and will forever keep on
increasing the hash table.

> 2) With class unloading enabled, I see the following being printed:
>
> service.gc.0:[24304.545s][debug][membername,table] ResolvedMethod entries counted 94768 removed 92352
> service.gc.0:[36944.749s][debug][membername,table] ResolvedMethod entries counted 64041 removed 61616
> service.gc.0:[36944.749s][debug][membername,table] ResolvedMethod entries counted 64041 removed 61616
> service.gc.3:[42177.904s][debug][membername,table] ResolvedMethod entries counted 19346 removed 16930
> service.gc.1:[47231.731s][debug][membername,table] ResolvedMethod entries counted 20110 removed 17694
> service.gc.1:[47231.731s][debug][membername,table] ResolvedMethod entries counted 20110 removed 17694
> service.gc.3:[52267.616s][debug][membername,table] ResolvedMethod entries counted 27426 removed 25010
> service.gc.1:[57078.463s][debug][membername,table] ResolvedMethod entries counted 30835 removed 28419
> service.gc.3:[61907.835s][debug][membername,table] ResolvedMethod entries counted 32403 removed 29987
> service.gc.2:[66916.223s][debug][membername,table] ResolvedMethod entries counted 34545 removed 32129
> service.gc.2:[71786.212s][debug][membername,table] ResolvedMethod entries counted 39112 removed 36696
> service.gc.2:[76619.460s][debug][membername,table] ResolvedMethod entries counted 34305 removed 31889

Is this output from G1 or ZGC + JDK 12?

This output shows that the table is too small. I've created a
hotspot/runtime bug report for this issue:
https://bugs.openjdk.java.net/browse/JDK-8221393

Thanks,
StefanK


>
> ________________________________
> From: Stefan Karlsson <stefan.karlsson at oracle.com>
> Sent: Monday, March 18, 2019 1:33 PM
> To: Amir Hadadi; zgc-dev at openjdk.java.net
> Subject: Re: Stack walking performance issue
>
> On 2019-03-18 10:46, Amir Hadadi wrote:
>> Thanks Stefan and Alexey!
>> As the app is not loading more and more classes, I was not expecting not having class unloading to be an issue.
>> Can you please explain what's the ResolvedMethodTable and what can cause it to fill up?
>
> I'm not too familiar with that code, but it looks like it creates an
> entry in the table for each different java stack frame it encounters in
> the stack walking code.
>
> There are comments about the ResolvedMethodName here:
> http://hg.openjdk.java.net/jdk/jdk/file/5e1480a38a43/src/java.base/share/classes/java/lang/invoke/MemberName.java#l46
>
>> Can it keep filling up even when classes are not being constantly loaded?
>
> A GC that performs class unloading will actually do more cleaning work
> than unloading classes. Cleaning ResolvedMethodTable is one. Cleaning
> out unused entries in the StringTable is another.
>
> The ResolvedMethodTable holds "weak" pointers to ResolvedMethodNames
> java objects. If a ResolvedNMethodName is not strongly reachable, the GC
> will remove that entry (if class unloading is enabled). The
> ResolvedMethodName is installed in MemberName java objects (see code
> above). The stack frame code installs the MemberName object in
> StackFrameInfo.memberName here:
>
> void java_lang_StackFrameInfo::set_method_and_bci(Handle stackFrame,
> const methodHandle& method, int bci, TRAPS) {
> ...
>     // Create and install ResolvedMethodName in ResolvedMethodTable
>     CallInfo info(method(), ik, CHECK);
>     // Installs ResolveMethodName in MemberName
>     MethodHandles::init_method_MemberName(mname, info);
>
> And from your stack trace, the StackFrameInfo objects gets fetched here:
>
>    java_lang_StackFrameInfo::set_method_and_bci(Handle, methodHandle
> const&, int, Thread*)
>    JavaFrameStream::fill_frame(int, objArrayHandle, methodHandle const&,
> Thread*)
>    StackWalk::fill_in_frames(long, BaseFrameStream&, int, int,
> objArrayHandle, int&, Thread*)
>    StackWalk::fetchNextBatch(Handle, long, long, int, int,
> objArrayHandle, Thread*)
>    JVM_MoreStackWalk
>    java.lang.StackStreamFactory$AbstractStackWalker.fetchStackFrames
>
> I guess that these StackFrameInfos are used further down in that code,
> and when they're not needed any more they will be dropped from the
> object graph. When that happens there will be no other path to the
> MemberName, and no path to the ResolvedMethodName, and the next class
> unloading GC will clear the entry in the ResolvedMethodTable.
>
> In ZGC (in JDK 11), we actually do clean out the "weak" pointers in the
> ResolvedMethodTable, but we do not remove the entries. Therefore, if you
> do a lot of stack walking and throw away the StackFrameInfos, the
> ResolvedMethodTable will fill up with cleared entries pointing to NULL.
>
> So, in short, using the stack walking API with the JDK 11 version of
> ZGC, has the performance problem you've reported.
>
>>
>> I will run the experiments you suggested and report back.
>
> Thanks!
> StefanK
>
>> ________________________________
>> From: Stefan Karlsson <stefan.karlsson at oracle.com>
>> Sent: Monday, March 18, 2019 10:54 AM
>> To: Amir Hadadi; zgc-dev at openjdk.java.net
>> Subject: Re: Stack walking performance issue
>>
>> Hi Amir,
>>
>> Thanks for testing ZGC and reporting back this issue.
>>
>> The stack trace indicates that the JVM is performing lookups in the
>> ResolvedMethodTable:
>>
>>     ResolvedMethodTable::lookup(int, unsigned int, Method*)
>>     java_lang_invoke_ResolvedMethodName::find_resolved_method(methodHandle
>> const&, Thread*)
>> CallInfo::CallInfo(Method*, Klass*, Thread*)
>>
>> This table has a small, fixed size:
>>
>>      enum Constants {
>>        _table_size  = 1007
>>      };
>>
>> and it gets cleaned out during class unloading.
>>
>> ZGC doesn't have class unloading in JDK 11. In JDK 12, which is about to
>> go GA tomorrow [1], we've implemented concurrent class unloading for
>> ZGC. So, that version will clean out the ResolvedMethodTable.
>>
>> It would be good to verify that this is indeed the problem. Two things
>> that could be tested:
>>
>> 1) Run with G1 and class unloading disabled with -XX:-ClassUnloading. If
>> this is showing the same problem as ZGC, then we're on the right track.
>>
>> 2) Try with ZGC on JDK 12.
>>
>> When running with class unloading, you could also try to add
>> membername+table=debug to the -Xlog flag. For example:
>>
>>     -Xlog:gc*,membername+table=debug
>>
>> This will print the following line every time the GC cleans the
>> ResolvedMemberTable during class unloading:
>>
>>      log_debug(membername, table) ("ResolvedMethod entries counted %d
>> removed %d", _oops_counted, _oops_removed);
>>
>> This will show us how full this table is, and if it gets cleaned a lot.
>> If it's heavily populated even with class unloading, we probably need to
>> change it's implementation to use the new ConcurrentHashTable.
>>
>> Thanks,
>> StefanK
>>
>> [1] https://openjdk.java.net/projects/jdk/12/
>>
>> On 2019-03-17 15:43, Amir Hadadi wrote:
>>> We've encountered the following performance issue which happens on an instance deployed in docker with ubuntu 16.04, Linux kernel 4.4.0-92-generic and OpenJDK 11.0.2.
>>> The issue shows up with zgc but does not show up with G1.
>>>
>>> During a period of 10 days after deployment, CPU usage goes up steadily at a rate of ~10% per day. Eventually we have to restart all instances.
>>> I profiled our app after 4 days of uptime using async-profiler and found that the following stack is the most frequent stack: https://gist.github.com/amirhadadi/48b6f84e3b2412124e817a50608e6ddd
>>> I tried restarting the instance and waited 10 minutes before profiling, and stack walking shows up much less in sampling: https://gist.github.com/amirhadadi/0c43b087b9bfd995119a97cbf3557d21
>>> This is how the stack walk looks when profiling an instance deployed with g1 after 3 days: https://gist.github.com/amirhadadi/224c33a19bfd9ea8dcc264cefc641496
>>>
>>> Please help me figure this one out.
>>>
>>>
>>>
>>>
>>>


More information about the zgc-dev mailing list