Minor GC difference Java 7 vs Java 8
Chris Hurst
christopherhurst at hotmail.com
Tue Jun 3 13:51:31 UTC 2014
Hi,
Reducing parallel threads on the simple testbed sample code actually reduced minor GC STW's (presumably in this scenario multithreading is inefficient due to the simplicity of the task), however the reverse was true for the real application though for a signal thread reduction the difference is hard to measure sub ms if at all. I'd expect the difference to be because the real application will have a more complex object graph and be tenuring with a very small amount of promotion to old. If a spike does occur with parallel gc threads reduced ie increasing other process activity on the box the duration of the spike appears unaffected by the reduction in parallel GC threads which I guess I would expect.
Any suggestions as to what value to use for WorkStealingYieldsBeforeSleep, otherwise I'll just trying doubling it ? If you have any additional notes , details on this change that would be most helpful.
Chris
PS This particular application uses the following (we regularly test other GC permutations including G1 / CMS, our goal is lower latency) ..
-Xms1536m
-Xmx1536m
-XX:+PrintCommandLineFlags
-XX:+UseParallelOldGC
-XX:+UnlockDiagnosticVMOptions
-XX:PermSize=50M
-XX:-UseAdaptiveSizePolicy
-XX:+AlwaysPreTouch
-XX:MaxTenuringThreshold=15
-XX:InitialTenuringThreshold=15
-XX:+DTraceMonitorProbes
-XX:+ExtendedDTraceProbes
-Dsun.rmi.dgc.client.gcInterval=604800000
-Dsun.rmi.dgc.server.gcInterval=604800000
Date: Mon, 2 Jun 2014 15:52:14 -0700
From: jon.masamitsu at oracle.com
To: christopherhurst at hotmail.com
CC: hotspot-gc-use at openjdk.java.net
Subject: Re: Minor GC difference Java 7 vs Java 8
On 06/01/2014 05:14 PM, Chris Hurst
wrote:
Hi,
Thanks for the replies, I've been stuck on this issue for about
a year and had raised it with Oracle support but hadn't got
anywhere but last weekend I managed to get a lot further with it
..
I wrote a trivial program to continuously fill young gen and
release the garbage for use with some DTrace tests and this
showed a similar issue spikes wise from there I could work out
the issue as the parallel GC threads, i.e. anything less than
number of cores removed the spike (ie cores-1), for the test
program normal young GC oscillated about 1ms but spiked at about
15ms(need to check) (Reducing the parallel threads worked on the
real application in a similar way).
We managed to identify some very minor tasks (they were so small
they weren't showing up on some of our less fine grained CPU
monitoring) that occasionally competed for CPU, the effect was
surprising relatively but now we understand the cause we can
tune the Java 6 GC better.
The spikes were again larger than I would have expected and all
appear to be every close in size, I wouldn't have predicted this
from the issue but that's fine ;-)
Currently the Java 7 version is still not quite as good on
overall throughput when not spiking though I will recheck these
results, as our most recent tests were around tuning J6 with the
new info. We're happy with our Java 6 GC performance.
Although we can now reduce these already rare spikes
(potentially to zero), I can't 100% guarantee they won't occur
so I would still like to understand why Java 7 appears to handle
this scenario less efficiently.
Using Dtrace we were mostly seeing yields and looking at a stack
trace pointed us toward some JDK 7 changes and some newer java
options that might be related ?? ...
taskqueue.cpp
libc.so.1`lwp_yield+0x15
libjvm.so`__1cCosFyield6F_v_+0x257
libjvm.so`__1cWParallelTaskTerminatorFyield6M_v_+0x18
libjvm.so`__1cWParallelTaskTerminatorRoffer_termination6MpnUTerminatorTerminator__b_+0xe8
libjvm.so`__1cJStealTaskFdo_it6MpnNGCTaskManager_I_v_+0x378
libjvm.so`__1cMGCTaskThreadDrun6M_v_+0x19f
libjvm.so`java_start+0x1f2
libc.so.1`_thr_setup+0x4e libc.so.1`_lwp_start
17:29
uintx WorkStealingHardSpins =
4096 {experimental}
uintx WorkStealingSleepMillis =
1 {experimental}
uintx WorkStealingSpinToYieldRatio =
10 {experimental}
uintx WorkStealingYieldsBeforeSleep =
5000 {experimental}
I haven't had a chance to play with these as yet but could these
be involved eg j7 tuned to be more friendly to other
applications at the cost of latency (spin to yield) ? Would that
make sense ?
Chris,
My best recollection is that there was a performance regression
reported internally and the change to 5000 was to fix
that regression. Increasing the number of yield's done before
a sleep made this code work more like the previous behavior.
Let me know if you need better information and I can see what
I can dig up.
By the way, when you tuned down the number of ParallelGCThreads,
you saw little or no increase in the STW pause times?
You're using UseParallelGC?
Jon
We would like to move to Java 7 for support
reasons, also as we are on Solaris the extra memory over head of
J8 (64bit only) even with compressed oops gives us another
latency hit.
Chris
PS -XX:+AlwaysPreTouch is on.
> Date: Thu, 29 May 2014 13:47:39 -0700
> From: Peter.B.Kessler at Oracle.COM
> To: christopherhurst at hotmail.com;
hotspot-gc-use at openjdk.java.net
> Subject: Re: Minor GC difference Java 7 vs Java 8
>
> Are the -XX:+PrintGCDetails "[Times: user=0.01 sys=0.00,
real=0.03 secs]" reports for the long pauses different from
the short pauses? I'm hoping for some anomalous sys time, or
user/real ratio, that would indicate it was something
happening on the machine that is interfering with the
collector. But you'd think that would show up as occasional
15ms blips in your message processing latency outside of when
the collector goes off.
>
> Does -XX:+PrintHeapAtGC show anything anomalous about the
space occupancy after the long pauses? E.g., more objects
getting copied to the survivor space, or promoted to the old
generation? You could infer the numbers from
-XX:+PrintGCDetails output if you didn't want to deal with the
volume produced by -XX:+PrintHeapAtGC.
>
> You don't say how large or how stable your old generation
size is. If you have to get new pages from the OS to expand
the old generation, or give pages back to the OS because the
old generation can shrink, that's extra work. You can infer
this traffic from -XX:+PrintHeapAtGC output by looking at the
"committed" values for the generations. E.g., in "ParOldGen
total 43008K, used 226K [0xba400000, 0xbce00000, 0xe4e00000)"
those three hex numbers are the start address for the
generation, the end of the committed memory for that
generation, and the end of the reserved memory for that
generation. There's a similar report for the young generation.
Running with -Xms equal to -Xmx should prevent pages from
being acquired from or returned to the OS during the run.
>
> Are you running with -XX:+AlwaysPreTouch? Even if you've
reserved and committed the address space, the first time you
touch new pages the OS wants to zero them, which takes time.
That flags forces all the zeroing at initialization. If you
know your page size, you should be able to see the generations
(mostly the old generation) crossing a page boundary for the
first time in the -XX:+PrintHeapAtGC output.
>
> Or it could be some change in the collector between JDK-6
and JDK-7.
>
> Posting some log snippets might let sharper eyes see
something.
>
> ... peter
>
> On 04/30/14 07:58, Chris Hurst wrote:
> > Hi,
> >
> > Has anyone seen anything similar to this ...
> >
> > On java 6 (range of versions 32bit Solaris)
application , using parallel old gc, non adapative. Using a
very heavy test performance load we see minor GC's around the
5ms mark and some very rare say 3or4 ish instances in 12 hours
say 20ms pauses the number of pauses is random (though always
few compares with the total number of GC's) and large ~20ms
(this value appears the same for all such points.) We have a
large number of minor GC's in our runs, only a full GC at
startup. These freak GC's can be bunched or spread out and we
can run for many hours without one (though doing minor GC's).
> >
> > What's odd is that if I use Java 7 (range of
versions 32bit) the result is very close but the spikes (1 or
2 arguably less) are now 30-40ms (depends on run arguably even
rarer). Has anyone experienced anything similar why would Java
7 up to double a minor GC / The GC throughput is approximately
the same arguably 7 is better throughput just but that freak
minor GC makes it usable due to latency.
> >
> > In terms of the change in spike height (20
(J6)vs40(J7)) this is very reproducible though the number of
points and when they occur varies slightly. The over all GC
graph , throughput is similar otherwise as is the resultant
memory dump at the end. The test should be constant load,
multiple clients just doing the same thing over and over.
> >
> > Has anyone seen anything similar, I was hoping
someone might have seen a change in defaults, thread timeout,
default data structure size change that would account for
this. I was hoping the marked increase might be a give away to
someone as its way off our average minor GC time.
> >
> > We have looked at gclogs, heap dumps, processor
activity, background processes, amount of disc access,
safepoints etc etc. , we trace message rate into out of the
application for variation, compare heap dumps at end etc.
nothing stands out so far.
> >
> > Chris
> >
> >
> >
> >
> >
> >
> >
> >
> > _______________________________________________
> > hotspot-gc-use mailing list
> > hotspot-gc-use at openjdk.java.net
> >
http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
> >
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20140603/2a240cbd/attachment-0001.html>
More information about the hotspot-gc-use
mailing list