Unexplained Jumps In Response Time "Fixed" By Disabling Tiered Compilation

Wed May 25 14:33:25 UTC 2016

A few weeks ago I was assigned a production bug related to a sudden jump
in response times for a service built on Solr-Lucene.  A restart of the
service would restore the normal response times for about 4 to 5 hours
and then the response times would suddenly jump up again (often to a
different level).  I was able to recreate this issue dozens of times in
a dev environment using a complete clone of the production environment
(EC2 node types, configuration, etc).  In the dev case the only traffic
was a background warming query that ran once per minute. There was also
some background core replication work.  The warming query was
expensive.  It typically took about 2 seconds to run.  After 4 to 5
hours it would suddenly jump to somewhere between 2.5 to 5 seconds and
then hold steady at that new value for as long as I let it run (as much
as 12 hours since I would sometimes restart the service at the end of
the day and let it run overnight).  In production this query would
suddenly jump (along with the rest of the traffic) to a higher value and
then hold steady at that new value for as long as one week before a
restart temporarily fixed it.

Two services running on two different nodes would typically jump to a
different steady state value for the warming query.  I tried attaching
jprofiler, but, as soon as the tool instrumented the code (and before
actually running it) the transformation of the classes would "fix" the
problem.  When I ran jprofiler in sampling mode it just looked like
"everything" was running slow.  I tried manually instrumenting the code
with only limited success (hot swapping in timing code at the high level).

I pursued many avenues of investigation (application logs, heap, linux
page cache) until I found:

http://stackoverflow.com/questions/14876447/what-could-cause-global-tomcat-jvm-slowdown

This post suggested bumping up Code Cache size:

   -XX:ReservedCodeCacheSize=256m

We are running:

   openjdk version "1.8.0_91"
   OpenJDK Runtime Environment (build 1.8.0_91-b14)
   OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)

and by default (according to jconsole) we already have a very large Code
Cache size (240m) and bumping it up to 256m had no impact on the issue.

I monitored the Code Cache usage and noted that the problem seem to
occur when Code Cache usage reached about 70m.  This may not be important.

I also played with increasing the initial Code Cache size, but, got very
unstable results.  The initial (pre-jump) steady state for the warming
query varied a lot from restart to restart (it was like the "jump"
occurred immediately):

   -XX:InitialCodeCacheSize=128m
   -XX:ReservedCodeCacheSize=256m

Run  Time(ms)
1         2197
2         2584
3         2365
4         2145
5         3202
6         5152
7        2218
8        4680

I then tried turning off Tiered Compilation (without changing the
default Code Cache size) in dev and the problem went away.  Of course it
is hard to prove a negative.  We are now testing in production with good
results (over 40 hours of good performance).

Any thoughts?  I realize this is not much to go on, but, I thought I
would report my issue just in case you get other reports like this and
you start to see a pattern.

________________________________

This email and any attachments may contain confidential and privileged material for the sole use of the intended recipient. Any review, copying, or distribution of this email (or any attachments) by others is prohibited. If you are not the intended recipient, please contact the sender immediately and permanently delete this email and any attachments. No employee or agent of TiVo Inc. is authorized to conclude any binding agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo Inc. may only be made by a signed written agreement.