Minor GC difference Java 7 vs Java 8

Tue Jun 3 13:51:31 UTC 2014

Hi,

 Reducing parallel threads on the simple testbed sample code actually reduced minor GC STW's (presumably in this scenario multithreading is inefficient due to the simplicity of the task), however the reverse was true for the real application though for a signal thread reduction the difference is hard to measure sub ms if at all. I'd expect the difference to be because the real application will have a more complex object graph and be tenuring with a very small amount of promotion to old. If a spike does occur with parallel gc threads reduced ie increasing other process activity on the box the duration of the spike appears unaffected by the reduction in parallel GC threads which I guess I would expect.

Any suggestions as to what value to use for WorkStealingYieldsBeforeSleep, otherwise I'll just trying doubling it ? If you have any additional notes , details on this change that would be most helpful. 

Chris

  PS This particular application uses the following (we regularly test other GC permutations including G1 / CMS, our goal is lower latency) ..

 -Xms1536m
-Xmx1536m
-XX:+PrintCommandLineFlags
-XX:+UseParallelOldGC
-XX:+UnlockDiagnosticVMOptions
-XX:PermSize=50M
-XX:-UseAdaptiveSizePolicy
-XX:+AlwaysPreTouch
-XX:MaxTenuringThreshold=15
-XX:InitialTenuringThreshold=15
-XX:+DTraceMonitorProbes
-XX:+ExtendedDTraceProbes
-Dsun.rmi.dgc.client.gcInterval=604800000
-Dsun.rmi.dgc.server.gcInterval=604800000

Date: Mon, 2 Jun 2014 15:52:14 -0700
From: jon.masamitsu at oracle.com
To: christopherhurst at hotmail.com
CC: hotspot-gc-use at openjdk.java.net
Subject: Re: Minor GC difference Java 7 vs Java 8

    On 06/01/2014 05:14 PM, Chris Hurst
      wrote:

      Hi,

        Thanks for the replies, I've been stuck on this issue for about
        a year and had raised it with Oracle support but hadn't got
        anywhere but last weekend I managed to get a lot further with it
        .. 

        I wrote a trivial program to continuously fill young gen and
        release the garbage for use with some DTrace tests and this
        showed a similar issue spikes wise from there I could work out
        the issue as the parallel GC threads, i.e. anything less than
        number of cores removed the spike (ie cores-1), for the test
        program normal young GC oscillated about 1ms but spiked at about
        15ms(need to check) (Reducing the parallel threads worked on the
        real application in a similar way).

        We managed to identify some very minor tasks (they were so small
        they weren't showing up on some of our less fine grained CPU
        monitoring) that occasionally competed for CPU, the effect was
        surprising relatively but now we understand the cause we can
        tune the Java 6 GC better.

        The spikes were again larger than I would have expected and all
        appear to be every close in size, I wouldn't have predicted this
        from the issue but that's fine ;-)

        Currently the Java 7 version is still not quite as good on
        overall throughput when not spiking though I will recheck these
        results, as our most recent tests were around tuning J6 with the
        new info. We're happy with our Java 6 GC performance.

        Although we can now reduce these already rare spikes
        (potentially to zero), I can't 100% guarantee they won't occur
        so I would still like to understand why Java 7 appears to handle
        this scenario less efficiently.

        Using Dtrace we were mostly seeing yields and looking at a stack
        trace pointed us toward some JDK 7 changes and some newer java
        options that might be related ?? ...

        taskqueue.cpp

        libc.so.1`lwp_yield+0x15             
        libjvm.so`__1cCosFyield6F_v_+0x257             
        libjvm.so`__1cWParallelTaskTerminatorFyield6M_v_+0x18             
        libjvm.so`__1cWParallelTaskTerminatorRoffer_termination6MpnUTerminatorTerminator__b_+0xe8             
        libjvm.so`__1cJStealTaskFdo_it6MpnNGCTaskManager_I_v_+0x378             
        libjvm.so`__1cMGCTaskThreadDrun6M_v_+0x19f             
        libjvm.so`java_start+0x1f2             
        libc.so.1`_thr_setup+0x4e              libc.so.1`_lwp_start 
        17:29 

        uintx WorkStealingHardSpins                     =
        4096            {experimental}     

        uintx WorkStealingSleepMillis                   =
        1               {experimental}     

        uintx WorkStealingSpinToYieldRatio              =
        10              {experimental}     

        uintx WorkStealingYieldsBeforeSleep             =
        5000            {experimental}

        I haven't had a chance to play with these as yet but could these
        be involved eg j7 tuned to be more friendly to other
        applications at the cost of latency (spin to yield) ? Would that
        make sense ?

    Chris,

    My best recollection is that there was a performance regression

    reported internally and the change to 5000 was to fix

    that regression.   Increasing the number of yield's done before

    a sleep made this code work more like the previous behavior.

    Let me know if you need better information  and I can see what 

    I can dig up.

    By the way,  when you tuned down the number of ParallelGCThreads,

    you saw little or no increase in the STW pause times?

    You're using UseParallelGC?

    Jon

      We would like to move to Java 7 for support
        reasons, also as we are on Solaris the extra memory over head of
        J8 (64bit only) even with compressed oops gives us another
        latency hit.

        Chris

        PS -XX:+AlwaysPreTouch is on.

        > Date: Thu, 29 May 2014 13:47:39 -0700

          > From: Peter.B.Kessler at Oracle.COM

          > To: christopherhurst at hotmail.com;
          hotspot-gc-use at openjdk.java.net

          > Subject: Re: Minor GC difference Java 7 vs Java 8

          > 

          > Are the -XX:+PrintGCDetails "[Times: user=0.01 sys=0.00,
          real=0.03 secs]" reports for the long pauses different from
          the short pauses? I'm hoping for some anomalous sys time, or
          user/real ratio, that would indicate it was something
          happening on the machine that is interfering with the
          collector. But you'd think that would show up as occasional
          15ms blips in your message processing latency outside of when
          the collector goes off.

          > 

          > Does -XX:+PrintHeapAtGC show anything anomalous about the
          space occupancy after the long pauses? E.g., more objects
          getting copied to the survivor space, or promoted to the old
          generation? You could infer the numbers from
          -XX:+PrintGCDetails output if you didn't want to deal with the
          volume produced by -XX:+PrintHeapAtGC.

          > 

          > You don't say how large or how stable your old generation
          size is. If you have to get new pages from the OS to expand
          the old generation, or give pages back to the OS because the
          old generation can shrink, that's extra work. You can infer
          this traffic from -XX:+PrintHeapAtGC output by looking at the
          "committed" values for the generations. E.g., in "ParOldGen
          total 43008K, used 226K [0xba400000, 0xbce00000, 0xe4e00000)"
          those three hex numbers are the start address for the
          generation, the end of the committed memory for that
          generation, and the end of the reserved memory for that
          generation. There's a similar report for the young generation.
          Running with -Xms equal to -Xmx should prevent pages from
          being acquired from or returned to the OS during the run.

          > 

          > Are you running with -XX:+AlwaysPreTouch? Even if you've
          reserved and committed the address space, the first time you
          touch new pages the OS wants to zero them, which takes time.
          That flags forces all the zeroing at initialization. If you
          know your page size, you should be able to see the generations
          (mostly the old generation) crossing a page boundary for the
          first time in the -XX:+PrintHeapAtGC output.

          > 

          > Or it could be some change in the collector between JDK-6
          and JDK-7.

          > 

          > Posting some log snippets might let sharper eyes see
          something.

          > 

          > ... peter

          > 

          > On 04/30/14 07:58, Chris Hurst wrote:

          > > Hi,

          > >

          > > Has anyone seen anything similar to this ...

          > >

          > > On java 6 (range of versions 32bit Solaris)
          application , using parallel old gc, non adapative. Using a
          very heavy test performance load we see minor GC's around the
          5ms mark and some very rare say 3or4 ish instances in 12 hours
          say 20ms pauses the number of pauses is random (though always
          few compares with the total number of GC's) and large ~20ms
          (this value appears the same for all such points.) We have a
          large number of minor GC's in our runs, only a full GC at
          startup. These freak GC's can be bunched or spread out and we
          can run for many hours without one (though doing minor GC's).

          > >

          > > What's odd is that if I use Java 7 (range of
          versions 32bit) the result is very close but the spikes (1 or
          2 arguably less) are now 30-40ms (depends on run arguably even
          rarer). Has anyone experienced anything similar why would Java
          7 up to double a minor GC / The GC throughput is approximately
          the same arguably 7 is better throughput just but that freak
          minor GC makes it usable due to latency.

          > >

          > > In terms of the change in spike height (20
          (J6)vs40(J7)) this is very reproducible though the number of
          points and when they occur varies slightly. The over all GC
          graph , throughput is similar otherwise as is the resultant
          memory dump at the end. The test should be constant load,
          multiple clients just doing the same thing over and over.

          > >

          > > Has anyone seen anything similar, I was hoping
          someone might have seen a change in defaults, thread timeout,
          default data structure size change that would account for
          this. I was hoping the marked increase might be a give away to
          someone as its way off our average minor GC time.

          > >

          > > We have looked at gclogs, heap dumps, processor
          activity, background processes, amount of disc access,
          safepoints etc etc. , we trace message rate into out of the
          application for variation, compare heap dumps at end etc.
          nothing stands out so far.

          > >

          > > Chris

          > >

          > >

          > >

          > >

          > >

          > >

          > >

          > >

          > > _______________________________________________

          > > hotspot-gc-use mailing list

          > > hotspot-gc-use at openjdk.java.net

          > >
          http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

          > >

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20140603/2a240cbd/attachment-0001.html>