YGC time increasing suddenly

Thu Dec 19 09:00:47 PST 2013

Bernd and Wolfgang, thanks for your quick answers. I took some time to
answer them because I was running some tests based on your comments.

*Bernd:* I would look at the finalizer queue first.
*A: *From what I could find in the code, it doesn't seem to have explicit
finalizers. Is there any way to check this queue?I found some articles
about the problems finalize() method may cause, but I wasn't able to find
something related to monitoring this queue.

*Bernd:* And if that does not cut it,  take a heapdump and inspect it for
unexpected large dominators (maybe  cached softreferences - not sure about
RMI DGC havent seen problems with  it, but it sure can be a problem if it
only cleans up once an hour.).
*A:* Regarding RMI, I ran some tests replacing it by JeroMQ but
unfortunately I got the same results. About heapdump, Eclipse MAT shows
almost nothing (only ~50mb) because the majority of objects are unreachable.

*Bernd:* How often do you see YGC at the beginning and then over time? It
looks   like every 2s? You might want to resize YGC by larger factors (but
with the yg already at 4g I guess something else is a problem here).
*A*: After I start my tests, YGC occurs once or twice every 3 seconds, as
the following log shows:
jstat -gcutil 29331 3000
  S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT
  1.40   0.00  89.74   2.13  11.86    602   12.126     1    0.086   12.212
  1.64   0.00  66.92   2.13  11.86    604   12.166     1    0.086   12.252
  1.38   0.00  41.10   2.13  11.86    606   12.204     1    0.086   12.290
  1.47   0.00  10.86   2.13  11.86    608   12.244     1    0.086   12.330
  0.00   1.47  89.35   2.13  11.86    609   12.265     1    0.086   12.351
  0.00   1.51  62.11   2.13  11.86    611   12.305     1    0.086   12.391
  0.00   1.38  32.83   2.14  11.86    613   12.346     1    0.086   12.432
  0.00   0.96  11.06   2.21  11.86    615   12.386     1    0.086   12.472
  0.97   0.00  72.35   2.22  11.86    616   12.406     1    0.086   12.492
It keeps this rate during the whole time, the only difference is that
collections start to last longer.

*Bernd:* You claim that most of the data only lives for 100ms, that does
not match  with the age-size distribution (not at the beginning not at the
end).
*A:* I said that for 2 reasons. Firstly, you can see by the log bellow that
most transactions last < 25 ms:

| interval             | number of transactions | %                |
|------------------------+---------------------------+-------------------------|
|          0 ms <= n <  25 ms :    7487644     :   97.704                |
|         25 ms <= n < 50 ms :     137146     :    1.790                |
|         50 ms <= n < 75 ms :      26422     :    0.345                |
|         75 ms <= n < 100 ms :       8086     :    0.106                |
|        100 ms <= n < 200 ms :       4081     :    0.053                |
|        200 ms <= n < 500 ms :        216     :    0.003                |
|        500 ms <= n < 1000 ms :          0     :    0.000                |

And secondly, very few objects are promoted to old gen.

*Wolfgang*, what you said about survivor also seems to make sense, but I
ran some tests with survivorRation=8 and survivorRation=16 and the results
were pretty much the same.

I also collected some data using "sar -B" and vmstat commands in order to
try to find out something else.

sar -B

12:58:33 PM  pgpgin/s pgpgout/s   fault/s  majflt/s
12:58:43 PM      0.00      5.19     16.98      0.00
12:58:53 PM      0.00      6.80     20.70      0.00
12:59:03 PM      0.00     12.81     16.72      0.00
12:59:13 PM      0.00      3.60     17.98      0.00
12:59:23 PM      0.00     14.81    118.42      0.00
12:59:33 PM      0.00     11.20     90.70      0.00
12:59:43 PM      0.00      5.20    662.60      0.00 (here GC started to
take longer)
12:59:53 PM      0.00      5.20   1313.10      0.00
01:00:03 PM      0.00     20.42    960.66      0.00
01:00:13 PM      0.00     17.18    620.78      0.00
01:00:23 PM      0.00      3.60    725.93      0.00
01:00:33 PM      0.00     15.18    465.13      0.00
01:00:33 PM  pgpgin/s pgpgout/s   fault/s  majflt/s
01:00:43 PM      0.00     12.01    508.31      0.00
01:00:53 PM      0.00      6.00    588.50      0.00
01:01:03 PM      0.00     20.00    660.80      0.00
01:01:13 PM      0.00      6.79    553.05      0.00

Page faults start to increase along with the degradation problem, but I'm
not 100% sure about this relation, mainly because there's a lot of free
memory, as vmstat shows bellow. However, I saw some people saying that page
faults may occur even when there is free memory.

vmstat

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu
------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id
wa st
34  0      0 10803608 196472 925120    0    0     0     4 64804 109417 49
 7 45  0  0
17  0      0 10802604 196472 925120    0    0     0    14 66130 111493 52
 7 41  0  0
22  0      0 10795060 196472 925120    0    0     0    12 65331 110577 49
 7 45  0  0
20  0      0 10758080 196472 925120    0    0     0     4 65222 111041 48
 7 45  0  0
23  0      0 10712208 196472 925120    0    0     0     7 64759 110016 49
 7 45  0  0
 8  0      0 10682828 196472 925140    0    0     0    33 64780 109899 49
 7 44  0  0
17  0      0 10655280 196472 925140    0    0     0     5 64321 109619 50
 7 44  0  0
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu
------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id
wa st
17  0      0 10636300 196472 925140    0    0     0    12 64574 108885 50
 7 44  0  0
 4  0      0 10614888 196472 925140    0    0     0     5 63384 107379 49
 7 44  0  0
18  0      0 10595172 196472 925140    0    0     0    14 65450 110004 50
 7 43  0  0
28  0      0 10576420 196472 925140    0    0     0     4 64720 109119 48
 7 45  0  0
29  0      0 10554908 196472 925140    0    0     0    25 64051 108606 51
 7 42  0  0
33  0      0 10537584 196472 925140    0    0     0    11 64501 109765 50
 7 43  0  0
24  0      0 10521128 196472 925140    0    0     0     5 64439 109538 51
 7 42  0  0

It seems that vmstat doesn't show anything problematic.

Any other advice?

Thanks again.

On Wed, Dec 18, 2013 at 5:57 PM, Wolfgang Pedot
<wolfgang.pedot at finkzeit.at>wrote:

> Hi,
>
> this is the first time I write an answer on this mailing-list so this
> could be totally useless but here goes:
>
> Your survivor-space seems to be quite empty, is the usage that low on all
> collects during your test? If so you could increase your survivor-ratio to
> gain more eden-space and if not many objects die in survivor you could also
> reduce the tenuring threshold. Total survivor usage has grown 6-fold from
> first to last GC and survivor space needs to be copied on each young gc. I
> admit it should probably not take that long to copy 60MB though...
>
> Here is a young-gc from one of my logs for comparison:
>
> 30230.123: [ParNew
> Desired survivor size 524261784 bytes, new threshold 12 (max 15)
> - age   1:  113917760 bytes,  113917760 total
> - age   2:   86192768 bytes,  200110528 total
> - age   3:   59060992 bytes,  259171520 total
> - age   4:   59319272 bytes,  318490792 total
> - age   5:   45307432 bytes,  363798224 total
> - age   6:   29478464 bytes,  393276688 total
> - age   7:   27440744 bytes,  420717432 total
> - age   8:   27947680 bytes,  448665112 total
> - age   9:   27294496 bytes,  475959608 total
> - age  10:   32830144 bytes,  508789752 total
> - age  11:    7490968 bytes,  516280720 total
> - age  12:   10723104 bytes,  527003824 total
> - age  13:    4549808 bytes,  531553632 total
> : 4306611K->731392K(4388608K), 0.1433810 secs]
> 10422356K->6878961K(14116608K)
>
> This is with MaxNewSize 5500m and a Survivor-Ratio of 8. You can see that
> GC-time is higher than yours (6core 3.33GHz Xeon), survivor-usage is way
> higher though.
>
> Hope I could help
> Wolfgang
>
>
> Am 18.12.2013 19:58, schrieb Luciano Molinari:
>
>> Hi everybody,
>>
>> We have a standalone Java app that receives requests through RMI and
>> almost all the objects created by it are short (< ~100ms) lived objects.
>> This app is running on a 24 cores server with 16 GB RAM (Red Hat Linux).
>> During our performance tests (10k requests/second) we started to face a
>> problem where the throughput decreases suddenly just a few minutes
>> after the app was started.
>> So, I started to investigate GC behaviour and to make some adjustments
>> (increase memory, use CMS...) and now we are able to run our app
>> properly for about 35 minutes. At this point the time spent during young
>> collections grows sharply although no Full GC is executed (old gen is
>> only ~4% full).
>>
>> I've done tests with many different parameters, but currently I'm using
>> the following ones:
>> java -server -verbose:gc -XX:+PrintGCDetails
>> -XX:+PrintTenuringDistribution -XX:+PrintGCTimeStamps
>> -XX:PrintFLSStatistics=1 -XX:SurvivorRatio=4
>> -XX:ParallelGCThreads=8 -XX:PermSize=256m -XX:+UseParNewGC
>> -XX:MaxPermSize=256m -Xms7g -Xmx7g -XX:NewSize=4608m -XX:MaxNewSize=4608m
>> -XX:MaxTenuringThreshold=15 -Dsun.rmi.dgc.client.gcInterval=3600000
>> -Dsun.rmi.dgc.server.gcInterval=3600000
>>   -Djava.rmi.server.hostname=IP_ADDRESS
>>
>> If I use this same configuration (without CMS) the same problem occurs
>> after 20minutes, so it doesn't seem to be related to CMS. Actually, as I
>> mentioned above, CMS (Full GC) isn't executed during the tests.
>>
>> Some logs I've collected:
>>
>> 1992.748: [ParNew
>> Desired survivor size 402653184 bytes, new threshold 15 (max 15)
>> - age   1:    9308728 bytes,    9308728 total
>> - age   2:       3448 bytes,    9312176 total
>> - age   3:       1080 bytes,    9313256 total
>> - age   4:         32 bytes,    9313288 total
>> - age   5:      34768 bytes,    9348056 total
>> - age   6:         32 bytes,    9348088 total
>> - age  15:       2712 bytes,    9350800 total
>> : 3154710K->10313K(3932160K), 0.0273150 secs] 3215786K->71392K(6553600K)
>>
>> //14 YGC happened during this window
>>
>> 2021.165: [ParNew
>> Desired survivor size 402653184 bytes, new threshold 15 (max 15)
>> - age   1:    9459544 bytes,    9459544 total
>> - age   2:    3648200 bytes,   13107744 total
>> - age   3:    3837976 bytes,   16945720 total
>> - age   4:    3472448 bytes,   20418168 total
>> - age   5:    3586896 bytes,   24005064 total
>> - age   6:    3475560 bytes,   27480624 total
>> - age   7:    3520952 bytes,   31001576 total
>> - age   8:    3612088 bytes,   34613664 total
>> - age   9:    3355160 bytes,   37968824 total
>> - age  10:    3823032 bytes,   41791856 total
>> - age  11:    3304576 bytes,   45096432 total
>> - age  12:    3671288 bytes,   48767720 total
>> - age  13:    3558696 bytes,   52326416 total
>> - age  14:    3805744 bytes,   56132160 total
>> - age  15:    3429672 bytes,   59561832 total
>> : 3230658K->77508K(3932160K), 0.1143860 secs] 3291757K->142447K(6553600K)
>>
>> Besides the longer time to perform collection, I also realized that all
>> 15 ages started to have larger values.
>>
>> I must say I'm a little confused about this scenario. Does anyone have
>> some tip?
>>
>> Thanks in advance,
>> --
>> Luciano
>>
>>
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>
>>
>

-- 
Luciano Davoglio Molinari
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20131219/4ce1be43/attachment-0001.html