YGC time increasing suddenly
Jon Masamitsu
jon.masamitsu at oracle.com
Thu Dec 19 12:54:37 PST 2013
On 12/19/2013 09:00 AM, Luciano Molinari wrote:
> Bernd and Wolfgang, thanks for your quick answers. I took some time to
> answer them because I was running some tests based on your comments.
>
> *Bernd:* I would look at the finalizer queue first.
> *A: *From what I could find in the code, it doesn't seem to have
> explicit finalizers. Is there any way to check this queue?I found some
> articles about the problems finalize() method may cause, but I wasn't
> able to find something related to monitoring this queue.
Try -XX:+PrintReferenceGC
Output looks like
0.247: [GC (Allocation Failure) 0.247: [ParNew0.521: [SoftReference, 0
refs, 0.0000775 secs]0.521: [WeakReference, 6 refs, 0.0000239
secs]0.521: [FinalReference, 6 refs, 0.0000551 secs]0.521:
[PhantomReference, 0 refs, 0.0000745 secs]0.521: [JNI Weak Reference,
0.0000217 secs]: 34944K->4352K(39296K), 0.2746585 secs]
34944K->23967K(126720K), 0.2747496 secs] [Times: user=0.77 sys=0.10,
real=0.27 secs]
Jon
>
> *Bernd:* And if that does not cut it, take a heapdump and inspect it
> for unexpected large dominators (maybe cached softreferences - not
> sure about RMI DGC havent seen problems with it, but it sure can be a
> problem if it only cleans up once an hour.).
> *A:* Regarding RMI, I ran some tests replacing it by JeroMQ but
> unfortunately I got the same results. About heapdump, Eclipse MAT
> shows almost nothing (only ~50mb) because the majority of objects are
> unreachable.
>
> *Bernd:* How often do you see YGC at the beginning and then over time?
> It looks like every 2s? You might want to resize YGC by larger
> factors (but with the yg already at 4g I guess something else is a
> problem here).
> *A*: After I start my tests, YGC occurs once or twice every 3 seconds,
> as the following log shows:
> jstat -gcutil 29331 3000
> S0 S1 E O P YGC YGCT FGC FGCT GCT
> 1.40 0.00 89.74 2.13 11.86 602 12.126 1 0.086 12.212
> 1.64 0.00 66.92 2.13 11.86 604 12.166 1 0.086 12.252
> 1.38 0.00 41.10 2.13 11.86 606 12.204 1 0.086 12.290
> 1.47 0.00 10.86 2.13 11.86 608 12.244 1 0.086 12.330
> 0.00 1.47 89.35 2.13 11.86 609 12.265 1 0.086 12.351
> 0.00 1.51 62.11 2.13 11.86 611 12.305 1 0.086 12.391
> 0.00 1.38 32.83 2.14 11.86 613 12.346 1 0.086 12.432
> 0.00 0.96 11.06 2.21 11.86 615 12.386 1 0.086 12.472
> 0.97 0.00 72.35 2.22 11.86 616 12.406 1 0.086 12.492
> It keeps this rate during the whole time, the only difference is that
> collections start to last longer.
>
> *Bernd:* You claim that most of the data only lives for 100ms, that
> does not match with the age-size distribution (not at the beginning
> not at the end).
> *A:* I said that for 2 reasons. Firstly, you can see by the log bellow
> that most transactions last < 25 ms:
>
> | interval | number of transactions | % |
> |------------------------+---------------------------+-------------------------|
> | 0 ms <= n < 25 ms : 7487644 : 97.704 |
> | 25 ms <= n < 50 ms : 137146 : 1.790 |
> | 50 ms <= n < 75 ms : 26422 : 0.345 |
> | 75 ms <= n < 100 ms : 8086 : 0.106 |
> | 100 ms <= n < 200 ms : 4081 : 0.053 |
> | 200 ms <= n < 500 ms : 216 : 0.003 |
> | 500 ms <= n < 1000 ms : 0 : 0.000 |
>
> And secondly, very few objects are promoted to old gen.
>
> *Wolfgang*, what you said about survivor also seems to make sense, but
> I ran some tests with survivorRation=8 and survivorRation=16 and the
> results were pretty much the same.
>
> I also collected some data using "sar -B" and vmstat commands in order
> to try to find out something else.
>
> sar -B
>
> 12:58:33 PM pgpgin/s pgpgout/s fault/s majflt/s
> 12:58:43 PM 0.00 5.19 16.98 0.00
> 12:58:53 PM 0.00 6.80 20.70 0.00
> 12:59:03 PM 0.00 12.81 16.72 0.00
> 12:59:13 PM 0.00 3.60 17.98 0.00
> 12:59:23 PM 0.00 14.81 118.42 0.00
> 12:59:33 PM 0.00 11.20 90.70 0.00
> 12:59:43 PM 0.00 5.20 662.60 0.00 (here GC started
> to take longer)
> 12:59:53 PM 0.00 5.20 1313.10 0.00
> 01:00:03 PM 0.00 20.42 960.66 0.00
> 01:00:13 PM 0.00 17.18 620.78 0.00
> 01:00:23 PM 0.00 3.60 725.93 0.00
> 01:00:33 PM 0.00 15.18 465.13 0.00
> 01:00:33 PM pgpgin/s pgpgout/s fault/s majflt/s
> 01:00:43 PM 0.00 12.01 508.31 0.00
> 01:00:53 PM 0.00 6.00 588.50 0.00
> 01:01:03 PM 0.00 20.00 660.80 0.00
> 01:01:13 PM 0.00 6.79 553.05 0.00
>
> Page faults start to increase along with the degradation problem, but
> I'm not 100% sure about this relation, mainly because there's a lot of
> free memory, as vmstat shows bellow. However, I saw some people saying
> that page faults may occur even when there is free memory.
>
> vmstat
>
> procs -----------memory---------- ---swap-- -----io---- --system--
> -----cpu------
> r b swpd free buff cache si so bi bo in cs us sy id wa st
> 34 0 0 10803608 196472 925120 0 0 0 4 64804 109417
> 49 7 45 0 0
> 17 0 0 10802604 196472 925120 0 0 0 14 66130 111493
> 52 7 41 0 0
> 22 0 0 10795060 196472 925120 0 0 0 12 65331 110577
> 49 7 45 0 0
> 20 0 0 10758080 196472 925120 0 0 0 4 65222 111041
> 48 7 45 0 0
> 23 0 0 10712208 196472 925120 0 0 0 7 64759 110016
> 49 7 45 0 0
> 8 0 0 10682828 196472 925140 0 0 0 33 64780 109899
> 49 7 44 0 0
> 17 0 0 10655280 196472 925140 0 0 0 5 64321 109619
> 50 7 44 0 0
> procs -----------memory---------- ---swap-- -----io---- --system--
> -----cpu------
> r b swpd free buff cache si so bi bo in cs us sy id wa st
> 17 0 0 10636300 196472 925140 0 0 0 12 64574 108885
> 50 7 44 0 0
> 4 0 0 10614888 196472 925140 0 0 0 5 63384 107379
> 49 7 44 0 0
> 18 0 0 10595172 196472 925140 0 0 0 14 65450 110004
> 50 7 43 0 0
> 28 0 0 10576420 196472 925140 0 0 0 4 64720 109119
> 48 7 45 0 0
> 29 0 0 10554908 196472 925140 0 0 0 25 64051 108606
> 51 7 42 0 0
> 33 0 0 10537584 196472 925140 0 0 0 11 64501 109765
> 50 7 43 0 0
> 24 0 0 10521128 196472 925140 0 0 0 5 64439 109538
> 51 7 42 0 0
>
> It seems that vmstat doesn't show anything problematic.
>
> Any other advice?
>
> Thanks again.
>
>
> On Wed, Dec 18, 2013 at 5:57 PM, Wolfgang Pedot
> <wolfgang.pedot at finkzeit.at <mailto:wolfgang.pedot at finkzeit.at>> wrote:
>
> Hi,
>
> this is the first time I write an answer on this mailing-list so
> this could be totally useless but here goes:
>
> Your survivor-space seems to be quite empty, is the usage that low
> on all collects during your test? If so you could increase your
> survivor-ratio to gain more eden-space and if not many objects die
> in survivor you could also reduce the tenuring threshold. Total
> survivor usage has grown 6-fold from first to last GC and survivor
> space needs to be copied on each young gc. I admit it should
> probably not take that long to copy 60MB though...
>
> Here is a young-gc from one of my logs for comparison:
>
> 30230.123: [ParNew
> Desired survivor size 524261784 bytes, new threshold 12 (max 15)
> - age 1: 113917760 bytes, 113917760 total
> - age 2: 86192768 bytes, 200110528 total
> - age 3: 59060992 bytes, 259171520 total
> - age 4: 59319272 bytes, 318490792 total
> - age 5: 45307432 bytes, 363798224 total
> - age 6: 29478464 bytes, 393276688 total
> - age 7: 27440744 bytes, 420717432 total
> - age 8: 27947680 bytes, 448665112 total
> - age 9: 27294496 bytes, 475959608 total
> - age 10: 32830144 bytes, 508789752 total
> - age 11: 7490968 bytes, 516280720 total
> - age 12: 10723104 bytes, 527003824 total
> - age 13: 4549808 bytes, 531553632 total
> : 4306611K->731392K(4388608K), 0.1433810 secs]
> 10422356K->6878961K(14116608K)
>
> This is with MaxNewSize 5500m and a Survivor-Ratio of 8. You can
> see that GC-time is higher than yours (6core 3.33GHz Xeon),
> survivor-usage is way higher though.
>
> Hope I could help
> Wolfgang
>
>
> Am 18.12.2013 19:58, schrieb Luciano Molinari:
>
> Hi everybody,
>
> We have a standalone Java app that receives requests through
> RMI and
> almost all the objects created by it are short (< ~100ms)
> lived objects.
> This app is running on a 24 cores server with 16 GB RAM (Red
> Hat Linux).
> During our performance tests (10k requests/second) we started
> to face a
> problem where the throughput decreases suddenly just a few minutes
> after the app was started.
> So, I started to investigate GC behaviour and to make some
> adjustments
> (increase memory, use CMS...) and now we are able to run our app
> properly for about 35 minutes. At this point the time spent
> during young
> collections grows sharply although no Full GC is executed (old
> gen is
> only ~4% full).
>
> I've done tests with many different parameters, but currently
> I'm using
> the following ones:
> java -server -verbose:gc -XX:+PrintGCDetails
> -XX:+PrintTenuringDistribution -XX:+PrintGCTimeStamps
> -XX:PrintFLSStatistics=1 -XX:SurvivorRatio=4
> -XX:ParallelGCThreads=8 -XX:PermSize=256m -XX:+UseParNewGC
> -XX:MaxPermSize=256m -Xms7g -Xmx7g -XX:NewSize=4608m
> -XX:MaxNewSize=4608m
> -XX:MaxTenuringThreshold=15
> -Dsun.rmi.dgc.client.gcInterval=3600000
> -Dsun.rmi.dgc.server.gcInterval=3600000
> -Djava.rmi.server.hostname=IP_ADDRESS
>
> If I use this same configuration (without CMS) the same
> problem occurs
> after 20minutes, so it doesn't seem to be related to CMS.
> Actually, as I
> mentioned above, CMS (Full GC) isn't executed during the tests.
>
> Some logs I've collected:
>
> 1992.748: [ParNew
> Desired survivor size 402653184 bytes, new threshold 15 (max 15)
> - age 1: 9308728 bytes, 9308728 total
> - age 2: 3448 bytes, 9312176 total
> - age 3: 1080 bytes, 9313256 total
> - age 4: 32 bytes, 9313288 total
> - age 5: 34768 bytes, 9348056 total
> - age 6: 32 bytes, 9348088 total
> - age 15: 2712 bytes, 9350800 total
> : 3154710K->10313K(3932160K), 0.0273150 secs]
> 3215786K->71392K(6553600K)
>
> //14 YGC happened during this window
>
> 2021.165: [ParNew
> Desired survivor size 402653184 bytes, new threshold 15 (max 15)
> - age 1: 9459544 bytes, 9459544 total
> - age 2: 3648200 bytes, 13107744 <tel:13107744> total
> - age 3: 3837976 bytes, 16945720 total
> - age 4: 3472448 bytes, 20418168 total
> - age 5: 3586896 bytes, 24005064 total
> - age 6: 3475560 bytes, 27480624 total
> - age 7: 3520952 bytes, 31001576 total
> - age 8: 3612088 bytes, 34613664 total
> - age 9: 3355160 bytes, 37968824 total
> - age 10: 3823032 bytes, 41791856 total
> - age 11: 3304576 bytes, 45096432 total
> - age 12: 3671288 bytes, 48767720 total
> - age 13: 3558696 bytes, 52326416 total
> - age 14: 3805744 bytes, 56132160 total
> - age 15: 3429672 bytes, 59561832 total
> : 3230658K->77508K(3932160K), 0.1143860 secs]
> 3291757K->142447K(6553600K)
>
> Besides the longer time to perform collection, I also realized
> that all
> 15 ages started to have larger values.
>
> I must say I'm a little confused about this scenario. Does
> anyone have
> some tip?
>
> Thanks in advance,
> --
> Luciano
>
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> <mailto:hotspot-gc-use at openjdk.java.net>
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
>
>
>
>
> --
> Luciano Davoglio Molinari
>
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20131219/02f92f9b/attachment.html
More information about the hotspot-gc-use
mailing list