YGC time increasing suddenly

Thu Dec 19 12:54:37 PST 2013

On 12/19/2013 09:00 AM, Luciano Molinari wrote:
> Bernd and Wolfgang, thanks for your quick answers. I took some time to 
> answer them because I was running some tests based on your comments.
>
> *Bernd:* I would look at the finalizer queue first.
> *A: *From what I could find in the code, it doesn't seem to have 
> explicit finalizers. Is there any way to check this queue?I found some 
> articles about the problems finalize() method may cause, but I wasn't 
> able to find something related to monitoring this queue.

Try -XX:+PrintReferenceGC

Output looks like

0.247: [GC (Allocation Failure) 0.247: [ParNew0.521: [SoftReference, 0 
refs, 0.0000775 secs]0.521: [WeakReference, 6 refs, 0.0000239 
secs]0.521: [FinalReference, 6 refs, 0.0000551 secs]0.521: 
[PhantomReference, 0 refs, 0.0000745 secs]0.521: [JNI Weak Reference, 
0.0000217 secs]: 34944K->4352K(39296K), 0.2746585 secs] 
34944K->23967K(126720K), 0.2747496 secs] [Times: user=0.77 sys=0.10, 
real=0.27 secs]

Jon

>
> *Bernd:* And if that does not cut it,  take a heapdump and inspect it 
> for unexpected large dominators (maybe  cached softreferences - not 
> sure about RMI DGC havent seen problems with  it, but it sure can be a 
> problem if it only cleans up once an hour.).
> *A:* Regarding RMI, I ran some tests replacing it by JeroMQ but 
> unfortunately I got the same results. About heapdump, Eclipse MAT 
> shows almost nothing (only ~50mb) because the majority of objects are 
> unreachable.
>
> *Bernd:* How often do you see YGC at the beginning and then over time? 
> It looks   like every 2s? You might want to resize YGC by larger 
> factors (but with the yg already at 4g I guess something else is a 
> problem here).
> *A*: After I start my tests, YGC occurs once or twice every 3 seconds, 
> as the following log shows:
> jstat -gcutil 29331 3000
>   S0     S1     E      O      P YGC YGCT FGC FGCT GCT
>   1.40   0.00  89.74   2.13  11.86    602   12.126     1  0.086   12.212
>   1.64   0.00  66.92   2.13  11.86    604   12.166     1  0.086   12.252
>   1.38   0.00  41.10   2.13  11.86    606   12.204     1  0.086   12.290
>   1.47   0.00  10.86   2.13  11.86    608   12.244     1  0.086   12.330
>   0.00   1.47  89.35   2.13  11.86    609   12.265     1  0.086   12.351
>   0.00   1.51  62.11   2.13  11.86    611   12.305     1  0.086   12.391
>   0.00   1.38  32.83   2.14  11.86    613   12.346     1  0.086   12.432
>   0.00   0.96  11.06   2.21  11.86    615   12.386     1  0.086   12.472
>   0.97   0.00  72.35   2.22  11.86    616   12.406     1  0.086   12.492
> It keeps this rate during the whole time, the only difference is that 
> collections start to last longer.
>
> *Bernd:* You claim that most of the data only lives for 100ms, that 
> does not match  with the age-size distribution (not at the beginning 
> not at the end).
> *A:* I said that for 2 reasons. Firstly, you can see by the log bellow 
> that most transactions last < 25 ms:
>
> | interval             | number of transactions | %        |
> |------------------------+---------------------------+-------------------------|
> |          0 ms <= n <  25 ms :    7487644     : 97.704                |
> |         25 ms <= n < 50 ms :     137146     :  1.790                |
> |         50 ms <= n < 75 ms :      26422     :  0.345                |
> |         75 ms <= n < 100 ms :       8086     :  0.106                |
> |        100 ms <= n < 200 ms :       4081     :  0.053                |
> |        200 ms <= n < 500 ms :        216     :  0.003                |
> |        500 ms <= n < 1000 ms :          0     :  0.000                |
>
> And secondly, very few objects are promoted to old gen.
>
> *Wolfgang*, what you said about survivor also seems to make sense, but 
> I ran some tests with survivorRation=8 and survivorRation=16 and the 
> results were pretty much the same.
>
> I also collected some data using "sar -B" and vmstat commands in order 
> to try to find out something else.
>
> sar -B
>
> 12:58:33 PM pgpgin/s pgpgout/s   fault/s majflt/s
> 12:58:43 PM      0.00      5.19     16.98      0.00
> 12:58:53 PM      0.00      6.80     20.70      0.00
> 12:59:03 PM      0.00     12.81     16.72      0.00
> 12:59:13 PM      0.00      3.60     17.98      0.00
> 12:59:23 PM      0.00     14.81    118.42      0.00
> 12:59:33 PM      0.00     11.20     90.70      0.00
> 12:59:43 PM      0.00      5.20    662.60      0.00 (here GC started 
> to take longer)
> 12:59:53 PM      0.00      5.20   1313.10      0.00
> 01:00:03 PM      0.00     20.42    960.66      0.00
> 01:00:13 PM      0.00     17.18    620.78      0.00
> 01:00:23 PM      0.00      3.60    725.93      0.00
> 01:00:33 PM      0.00     15.18    465.13      0.00
> 01:00:33 PM pgpgin/s pgpgout/s   fault/s majflt/s
> 01:00:43 PM      0.00     12.01    508.31      0.00
> 01:00:53 PM      0.00      6.00    588.50      0.00
> 01:01:03 PM      0.00     20.00    660.80      0.00
> 01:01:13 PM      0.00      6.79    553.05      0.00
>
> Page faults start to increase along with the degradation problem, but 
> I'm not 100% sure about this relation, mainly because there's a lot of 
> free memory, as vmstat shows bellow. However, I saw some people saying 
> that page faults may occur even when there is free memory.
>
> vmstat
>
> procs -----------memory---------- ---swap-- -----io---- --system-- 
> -----cpu------
>  r  b swpd   free   buff  cache si   so    bi bo   in   cs us sy id wa st
> 34  0      0 10803608 196472 925120    0    0     0     4 64804 109417 
> 49  7 45  0  0
> 17  0      0 10802604 196472 925120    0    0     0    14 66130 111493 
> 52  7 41  0  0
> 22  0      0 10795060 196472 925120    0    0     0    12 65331 110577 
> 49  7 45  0  0
> 20  0      0 10758080 196472 925120    0    0     0     4 65222 111041 
> 48  7 45  0  0
> 23  0      0 10712208 196472 925120    0    0     0     7 64759 110016 
> 49  7 45  0  0
>  8  0      0 10682828 196472 925140    0    0     0    33 64780 109899 
> 49  7 44  0  0
> 17  0      0 10655280 196472 925140    0    0     0     5 64321 109619 
> 50  7 44  0  0
> procs -----------memory---------- ---swap-- -----io---- --system-- 
> -----cpu------
>  r  b swpd   free   buff  cache si   so    bi bo   in   cs us sy id wa st
> 17  0      0 10636300 196472 925140    0    0     0    12 64574 108885 
> 50  7 44  0  0
>  4  0      0 10614888 196472 925140    0    0     0     5 63384 107379 
> 49  7 44  0  0
> 18  0      0 10595172 196472 925140    0    0     0    14 65450 110004 
> 50  7 43  0  0
> 28  0      0 10576420 196472 925140    0    0     0     4 64720 109119 
> 48  7 45  0  0
> 29  0      0 10554908 196472 925140    0    0     0    25 64051 108606 
> 51  7 42  0  0
> 33  0      0 10537584 196472 925140    0    0     0    11 64501 109765 
> 50  7 43  0  0
> 24  0      0 10521128 196472 925140    0    0     0     5 64439 109538 
> 51  7 42  0  0
>
> It seems that vmstat doesn't show anything problematic.
>
> Any other advice?
>
> Thanks again.
>
>
> On Wed, Dec 18, 2013 at 5:57 PM, Wolfgang Pedot 
> <wolfgang.pedot at finkzeit.at <mailto:wolfgang.pedot at finkzeit.at>> wrote:
>
>     Hi,
>
>     this is the first time I write an answer on this mailing-list so
>     this could be totally useless but here goes:
>
>     Your survivor-space seems to be quite empty, is the usage that low
>     on all collects during your test? If so you could increase your
>     survivor-ratio to gain more eden-space and if not many objects die
>     in survivor you could also reduce the tenuring threshold. Total
>     survivor usage has grown 6-fold from first to last GC and survivor
>     space needs to be copied on each young gc. I admit it should
>     probably not take that long to copy 60MB though...
>
>     Here is a young-gc from one of my logs for comparison:
>
>     30230.123: [ParNew
>     Desired survivor size 524261784 bytes, new threshold 12 (max 15)
>     - age   1:  113917760 bytes,  113917760 total
>     - age   2:   86192768 bytes,  200110528 total
>     - age   3:   59060992 bytes,  259171520 total
>     - age   4:   59319272 bytes,  318490792 total
>     - age   5:   45307432 bytes,  363798224 total
>     - age   6:   29478464 bytes,  393276688 total
>     - age   7:   27440744 bytes,  420717432 total
>     - age   8:   27947680 bytes,  448665112 total
>     - age   9:   27294496 bytes,  475959608 total
>     - age  10:   32830144 bytes,  508789752 total
>     - age  11:    7490968 bytes,  516280720 total
>     - age  12:   10723104 bytes,  527003824 total
>     - age  13:    4549808 bytes,  531553632 total
>     : 4306611K->731392K(4388608K), 0.1433810 secs]
>     10422356K->6878961K(14116608K)
>
>     This is with MaxNewSize 5500m and a Survivor-Ratio of 8. You can
>     see that GC-time is higher than yours (6core 3.33GHz Xeon),
>     survivor-usage is way higher though.
>
>     Hope I could help
>     Wolfgang
>
>
>     Am 18.12.2013 19:58, schrieb Luciano Molinari:
>
>         Hi everybody,
>
>         We have a standalone Java app that receives requests through
>         RMI and
>         almost all the objects created by it are short (< ~100ms)
>         lived objects.
>         This app is running on a 24 cores server with 16 GB RAM (Red
>         Hat Linux).
>         During our performance tests (10k requests/second) we started
>         to face a
>         problem where the throughput decreases suddenly just a few minutes
>         after the app was started.
>         So, I started to investigate GC behaviour and to make some
>         adjustments
>         (increase memory, use CMS...) and now we are able to run our app
>         properly for about 35 minutes. At this point the time spent
>         during young
>         collections grows sharply although no Full GC is executed (old
>         gen is
>         only ~4% full).
>
>         I've done tests with many different parameters, but currently
>         I'm using
>         the following ones:
>         java -server -verbose:gc -XX:+PrintGCDetails
>         -XX:+PrintTenuringDistribution -XX:+PrintGCTimeStamps
>         -XX:PrintFLSStatistics=1 -XX:SurvivorRatio=4
>         -XX:ParallelGCThreads=8 -XX:PermSize=256m -XX:+UseParNewGC
>         -XX:MaxPermSize=256m -Xms7g -Xmx7g -XX:NewSize=4608m
>         -XX:MaxNewSize=4608m
>         -XX:MaxTenuringThreshold=15
>         -Dsun.rmi.dgc.client.gcInterval=3600000
>         -Dsun.rmi.dgc.server.gcInterval=3600000
>           -Djava.rmi.server.hostname=IP_ADDRESS
>
>         If I use this same configuration (without CMS) the same
>         problem occurs
>         after 20minutes, so it doesn't seem to be related to CMS.
>         Actually, as I
>         mentioned above, CMS (Full GC) isn't executed during the tests.
>
>         Some logs I've collected:
>
>         1992.748: [ParNew
>         Desired survivor size 402653184 bytes, new threshold 15 (max 15)
>         - age   1:    9308728 bytes,    9308728 total
>         - age   2:       3448 bytes,    9312176 total
>         - age   3:       1080 bytes,    9313256 total
>         - age   4:         32 bytes,    9313288 total
>         - age   5:      34768 bytes,    9348056 total
>         - age   6:         32 bytes,    9348088 total
>         - age  15:       2712 bytes,    9350800 total
>         : 3154710K->10313K(3932160K), 0.0273150 secs]
>         3215786K->71392K(6553600K)
>
>         //14 YGC happened during this window
>
>         2021.165: [ParNew
>         Desired survivor size 402653184 bytes, new threshold 15 (max 15)
>         - age   1:    9459544 bytes,    9459544 total
>         - age   2:    3648200 bytes, 13107744 <tel:13107744> total
>         - age   3:    3837976 bytes,   16945720 total
>         - age   4:    3472448 bytes,   20418168 total
>         - age   5:    3586896 bytes,   24005064 total
>         - age   6:    3475560 bytes,   27480624 total
>         - age   7:    3520952 bytes,   31001576 total
>         - age   8:    3612088 bytes,   34613664 total
>         - age   9:    3355160 bytes,   37968824 total
>         - age  10:    3823032 bytes,   41791856 total
>         - age  11:    3304576 bytes,   45096432 total
>         - age  12:    3671288 bytes,   48767720 total
>         - age  13:    3558696 bytes,   52326416 total
>         - age  14:    3805744 bytes,   56132160 total
>         - age  15:    3429672 bytes,   59561832 total
>         : 3230658K->77508K(3932160K), 0.1143860 secs]
>         3291757K->142447K(6553600K)
>
>         Besides the longer time to perform collection, I also realized
>         that all
>         15 ages started to have larger values.
>
>         I must say I'm a little confused about this scenario. Does
>         anyone have
>         some tip?
>
>         Thanks in advance,
>         --
>         Luciano
>
>
>         _______________________________________________
>         hotspot-gc-use mailing list
>         hotspot-gc-use at openjdk.java.net
>         <mailto:hotspot-gc-use at openjdk.java.net>
>         http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
>
>
>
>
> -- 
> Luciano Davoglio Molinari
>
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20131219/02f92f9b/attachment.html