Growing GC Young Gen Times

Fri May 14 11:30:30 PDT 2010

Ramki~

File 3.

Matt

On Fri, May 14, 2010 at 2:30 PM, Matt Fowles <matt.fowles at gmail.com> wrote:
> Ramki~
>
> File 2.
>
> Matt
>
> On Fri, May 14, 2010 at 2:29 PM, Matt Fowles <matt.fowles at gmail.com> wrote:
>> Ramki~
>>
>> Attached are 3 different runs with slightly tweaked VM settings based
>> on suggestions from this list and others.
>>
>> All of them have reduced the MaxTenuringThreshold to 2.
>>
>> gc1.log reduces the young gen size to 1g and the old gen size to 7g
>> initially.  As you can see from it, the young gen sweep speed does not
>> improve after the CMS sweep that occurs part way through the run.
>> gc2.log adds the -XX:+UseLargePages and -XX:+AlwaysPreTouch options to
>> the settings from gc1.log
>> gc3.log adds the -XX:+UseLargePages and -XX:+AlwaysPreTouch options to
>> a 4g young gen with the original 32g total heap.
>>
>> Due to the quirks of the infrastructure running these tests, data
>> volumes (and hence allocation rates) are NOT comparable between runs.
>> Because the tests take ~4 hours to run, we run several tests in
>> parallel against different sources of data.
>>
>> Matt
>>
>> PS - Due to attachment size restrictions I am going to send each file
>> in its own email.
>>
>> On Fri, May 14, 2010 at 1:23 PM, Y. Srinivas Ramakrishna
>> <y.s.ramakrishna at oracle.com> wrote:
>>> On 05/14/10 10:07, Matt Fowles wrote:
>>>>
>>>> Ramki~
>>>>
>>>> The machine has 4 cpus each of which have 4 cores.  I will adjust the
>>>
>>> Great, thanks. I'd suggest make ParallelGCThreads=8. Also compare with
>>> -XX:-UseParNewGC. if it's the kind of fragmentation that we
>>> believe may be the cause here, you'd see larger gc times in the
>>> latter case but they would not increase as they do now. But that
>>> is conjecture at this point.
>>>
>>>> survivor spaces as you suggest.  Previously I had been running with
>>>> MTT 0, but change it to 4 at the suggestion of others.
>>>
>>> MTT=0 can give very poor performance, as people said MTT=4
>>> would definitely be better here than MTT=0.
>>> You should use MTT=1 here though.
>>>
>>>>
>>>> Running with the JDK7 version may take a bit of time, but I will
>>>> pursue that as well.
>>>
>>> All you should do is pull the libjvm.so that is in the JDK 7 installation
>>> (or bundle) and plonk it down into the appropriate directory of your
>>> existing JDK 6u20 installation. We just want to see the results with
>>> the latest JVM which includes a fix for 6631166.
>>>
>>> I attached a very rough plot of some metrics extracted from your log
>>> and this behaviour is definitely deserving of a bug, especially
>>> if it can be shown that it happens in the latest JVM. In the plot:
>>>
>>>  red: scavenge durations
>>>  dark blue: promoted data per scavenge
>>>  pink: data in survivor space following scavenge
>>>  light blue: live data in old gen
>>>
>>> As you can see the scavenge clearly correlates with the
>>> occupancy of the old gen (as Jon and others indicated).
>>> Did you try Jon's suggestion of doing a manual GC at that
>>> point via jconsole, and seeing if the upward trend of
>>> scavenges continues beyond that?
>>>
>>> Did you use -XX:+UseLargePages and -XX:+AlwaysPreTouch?
>>>
>>> Do you have an easily used test case that you can share with us via
>>> your support channels? If/when you do so, please copy me and
>>> send them a reference to this thread on this mailing list.
>>>
>>> later, with your new data.
>>> -- ramki
>>>
>>>>
>>>> Matt
>>>>
>>>>
>>>>
>>>> On Fri, May 14, 2010 at 12:58 PM, Y. Srinivas Ramakrishna
>>>> <y.s.ramakrishna at oracle.com> wrote:
>>>>>
>>>>> Hi Matt -- i am computing some metrics from yr log file
>>>>> and would like to know how many cpu's you have for the logs below?
>>>>>
>>>>> Also, as you noted, almost anything that survives a scavenge
>>>>> lives for a while. To reduce the overhead of unnecessary
>>>>> back-and-forth copying in the survivor spaces, just use
>>>>> MaxTenuringThreshold=1 (This suggestion was also made by
>>>>> several others in the thread, and is corroborated by your
>>>>> PrintTenuringDistribution data). Since you have farily large survivor
>>>>> spaces configured now, (at least large enough to fit 4 age cohorts,
>>>>> which will be down to 1 age cohort if you use MTT=1), i'd
>>>>> suggest making your surviror spaces smaller, may be down to
>>>>> about 64 MB from the current 420 MB each, and give the excess
>>>>> to your Eden space.
>>>>>
>>>>> Then use 6u21 when it comes out (or ask your Java support to
>>>>> send you a 6u21 for a beta test), or drop in a JVM from JDK 7 into
>>>>> your 6u20 installation, and run with that. If you still see
>>>>> rising pause times let me know or file a bug, and send us the
>>>>> log file and JVM options along with full platform information.
>>>>>
>>>>> I'll run some metrics from yr log file if you send me the info
>>>>> re platform above, and that may perhaps reveal a few more secrets.
>>>>>
>>>>> later.
>>>>> -- ramki
>>>>>
>>>>> On 05/12/10 15:19, Matt Fowles wrote:
>>>>>>
>>>>>> All~
>>>>>>
>>>>>> I have a large app that produces ~4g of garbage every 30 seconds and
>>>>>> am trying to reduce the size of gc outliers.  About 99% of this data
>>>>>> is garbage, but almost anything that survives one collection survives
>>>>>> for an indeterminately long amount of time.  We are currently using
>>>>>> the following VM and options:
>>>>>>
>>>>>> java version "1.6.0_20"
>>>>>> Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
>>>>>> Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)
>>>>>>
>>>>>>              -verbose:gc
>>>>>>              -XX:+PrintGCTimeStamps
>>>>>>              -XX:+PrintGCDetails
>>>>>>              -XX:+PrintGCTaskTimeStamps
>>>>>>              -XX:+PrintTenuringDistribution
>>>>>>              -XX:+PrintCommandLineFlags
>>>>>>              -XX:+PrintReferenceGC
>>>>>>              -Xms32g -Xmx32g -Xmn4g
>>>>>>              -XX:+UseParNewGC
>>>>>>              -XX:ParallelGCThreads=4
>>>>>>              -XX:+UseConcMarkSweepGC
>>>>>>              -XX:ParallelCMSThreads=4
>>>>>>              -XX:CMSInitiatingOccupancyFraction=60
>>>>>>              -XX:+UseCMSInitiatingOccupancyOnly
>>>>>>              -XX:+CMSParallelRemarkEnabled
>>>>>>              -XX:MaxGCPauseMillis=50
>>>>>>              -Xloggc:gc.log
>>>>>>
>>>>>>
>>>>>> As you can see from the GC log, we never actually reach the point
>>>>>> where the CMS kicks in (after app startup).  But our young gens seem
>>>>>> to take increasingly long to collect as time goes by.
>>>>>>
>>>>>> The steady state of the app is reached around 956.392 into the log
>>>>>> with a collection that takes 0.106 seconds.  Thereafter the survivor
>>>>>> space remains roughly constantly as filled and the amount promoted to
>>>>>> old gen also remains constant, but the collection times increase to
>>>>>> 2.855 seconds by the end of the 3.5 hour run.
>>>>>>
>>>>>> Has anyone seen this sort of behavior before?  Are there more switches
>>>>>> that I should try running with?
>>>>>>
>>>>>> Obviously, I am working to profile the app and reduce the garbage load
>>>>>> in parallel.  But if I still see this sort of problem, it is only a
>>>>>> question of how long must the app run before I see unacceptable
>>>>>> latency spikes.
>>>>>>
>>>>>> Matt
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------
>>>>>>
>>>>>> _______________________________________________
>>>>>> hotspot-gc-use mailing list
>>>>>> hotspot-gc-use at openjdk.java.net
>>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>
>>>
>>>
>>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: r229-vm-tweaks-gc3.log.bz2
Type: application/x-bzip2
Size: 9269 bytes
Desc: not available
Url : http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100514/47a15a46/attachment.bin