Stack walking performance issue

Mon Mar 18 08:54:52 UTC 2019

Hi Amir,

Thanks for testing ZGC and reporting back this issue.

The stack trace indicates that the JVM is performing lookups in the 
ResolvedMethodTable:

  ResolvedMethodTable::lookup(int, unsigned int, Method*)
  java_lang_invoke_ResolvedMethodName::find_resolved_method(methodHandle 
const&, Thread*)
CallInfo::CallInfo(Method*, Klass*, Thread*)

This table has a small, fixed size:

   enum Constants {
     _table_size  = 1007
   };

and it gets cleaned out during class unloading.

ZGC doesn't have class unloading in JDK 11. In JDK 12, which is about to 
go GA tomorrow [1], we've implemented concurrent class unloading for 
ZGC. So, that version will clean out the ResolvedMethodTable.

It would be good to verify that this is indeed the problem. Two things 
that could be tested:

1) Run with G1 and class unloading disabled with -XX:-ClassUnloading. If 
this is showing the same problem as ZGC, then we're on the right track.

2) Try with ZGC on JDK 12.

When running with class unloading, you could also try to add 
membername+table=debug to the -Xlog flag. For example:

  -Xlog:gc*,membername+table=debug

This will print the following line every time the GC cleans the 
ResolvedMemberTable during class unloading:

   log_debug(membername, table) ("ResolvedMethod entries counted %d 
removed %d", _oops_counted, _oops_removed);

This will show us how full this table is, and if it gets cleaned a lot. 
If it's heavily populated even with class unloading, we probably need to 
change it's implementation to use the new ConcurrentHashTable.

Thanks,
StefanK

[1] https://openjdk.java.net/projects/jdk/12/

On 2019-03-17 15:43, Amir Hadadi wrote:
> We've encountered the following performance issue which happens on an instance deployed in docker with ubuntu 16.04, Linux kernel 4.4.0-92-generic and OpenJDK 11.0.2.
> The issue shows up with zgc but does not show up with G1.
> 
> During a period of 10 days after deployment, CPU usage goes up steadily at a rate of ~10% per day. Eventually we have to restart all instances.
> I profiled our app after 4 days of uptime using async-profiler and found that the following stack is the most frequent stack: https://gist.github.com/amirhadadi/48b6f84e3b2412124e817a50608e6ddd
> I tried restarting the instance and waited 10 minutes before profiling, and stack walking shows up much less in sampling: https://gist.github.com/amirhadadi/0c43b087b9bfd995119a97cbf3557d21
> This is how the stack walk looks when profiling an instance deployed with g1 after 3 days: https://gist.github.com/amirhadadi/224c33a19bfd9ea8dcc264cefc641496
> 
> Please help me figure this one out.
> 
> 
> 
> 
>