Strange Long ParNew GC Pauses (Sample Code Included)

Thu Apr 30 13:16:27 UTC 2015

Recently we started seeing long ParNewGC pauses when starting up Kafka
that were causing session timeouts:

[2015-04-24 13:26:23,244] INFO 0 successfully elected as leader
(kafka.server.ZookeeperLeaderElector)
2.111: [GC (Allocation Failure) 2.111: [ParNew:
136320K->10236K(153344K), 0.0235777 secs] 648320K->522236K(2080128K),
0.0237092 secs] [Times: user=0.03 sys=0.01, real=0.02 secs]
2.599: [GC (Allocation Failure) 2.599: [ParNew: 146556K->3201K(153344K),
9.1514626 secs] 658556K->519191K(2080128K), 9.1515757 secs] [Times:
user=18.25 sys=0.01, real=9.15 secs]
[2015-04-24 13:26:33,443] INFO New leader is 0
(kafka.server.ZookeeperLeaderElector$LeaderChangeListener)

After much investigation I found that the trigger was the allocation of
a 500M static object early in the Kafka startup code.  It of course
makes no sense (to me) that the allocation of a single large static
object in Old memory would impact ParNew collections, but, testing here
indicates that it does.  I have created an Oracle bug report, but, it is
still under investigation.

I have reproduced the problem with a simple application on several Linux
platforms including an EC2 instance (and last night a Win 7 laptop) and
the following JREs:

OpenJDK: 6, 7, and 8
Oracle: 7 and 8

Oracle 6 does not seem to have an issue.  All its ParNewGC times are small.

Here is the simple program that demonstrates the issue (at least
everywhere I have tried it):

import java.util.ArrayList;

public class LongParNewPause {

     static byte[] bigStaticObject;

     public static void main(String[] args) throws Exception {
         int bigObjSize   = args.length > 0 ? Integer.parseInt(args[0])
: 500 * 1024 * 1024;
         int littleObjSize = args.length > 1 ? Integer.parseInt(args[1])
: 100;
         int saveFraction  = args.length > 2 ? Integer.parseInt(args[2])
: 10;

         bigStaticObject = new byte[bigObjSize];

         ArrayList<byte[]> holder = new ArrayList<byte[]>();

         int i = 0;
         while (true) {
             byte[] local = new byte[littleObjSize];
             if (i++ % saveFraction == 0) {
                 holder.add(local);
             }
         }
     }
}

I run it with the following options:

     -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Xmx2G -Xms2G

Note that I have not seen the issue with 1G heaps.  4G heaps also
exhibit the issue (as do heaps as small as 1.2G).  Bigger heaps seem to
aggrevate the problem.

Here is some sample output:

0.321: [GC (Allocation Failure) 0.321: [ParNew:
272640K->27329K(306688K), 0.0140537 secs] 784640K->539329K(2063104K),
0.0141584 secs] [Times: user=0.05 sys=0.02, real=0.02 secs]
0.368: [GC (Allocation Failure) 0.368: [ParNew:
299969K->34048K(306688K), 0.7655383 secs] 811969K->572321K(2063104K),
0.7656172 secs] [Times: user=2.89 sys=0.02, real=0.77 secs]
1.165: [GC (Allocation Failure) 1.165: [ParNew:
306688K->34048K(306688K), 13.8395969 secs] 844961K->599389K(2063104K),
13.8396650 secs] [Times: user=54.38 sys=0.05, real=13.84 secs]
15.036: [GC (Allocation Failure) 15.036: [ParNew:
306688K->34048K(306688K), 0.0287254 secs] 872029K->628028K(2063104K),
0.0287876 secs] [Times: user=0.08 sys=0.01, real=0.03 secs]
15.096: [GC (Allocation Failure) 15.096: [ParNew:
306688K->34048K(306688K), 0.0340727 secs] 900668K->657717K(2063104K),
0.0341386 secs] [Times: user=0.09 sys=0.00, real=0.03 secs]
...

Even stranger is the fact that the problem seems to be limited to
objects in the range of about 480M to 512M.  Specifically:

  [ 503316465 , 536870384 ]

Values outside this range appear to be OK.  Playing with the other input
parameters (size of the small objects and the fraction that are
retained) can modify the system behavior:

- Increase the length of the pauses - I have seen pauses of over a
minute my testing.
- Cause the pause times to steadily increase over many ParNewGC cycles.

During the pauses the CPUs seem to peg 100% (no waiting on resources).

Anyone have any thoughts?  I would be very interested to know if anyone
else can reproduce the issue?  Other engineers here have been able to.

________________________________

This email and any attachments may contain confidential and privileged material for the sole use of the intended recipient. Any review, copying, or distribution of this email (or any attachments) by others is prohibited. If you are not the intended recipient, please contact the sender immediately and permanently delete this email and any attachments. No employee or agent of TiVo Inc. is authorized to conclude any binding agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo Inc. may only be made by a signed written agreement.