GC_Locker turning CMS into a stop-the-world collector

Sat Jan 23 16:52:21 PST 2010

Just a little update from our side:

The situation improved a lot since we switched to sockets on the http side,
but loading of files are still blocked threads once in a while, on a
MappedByteBuffer.read() call. Tuning CMS to hurry up and finish asap by
disabling abortable-preclean and settings threads to 6 helps (hw: 2x4 core
prev gen Xeons), but the app still throws a "503 we're too busy, try again
later" a few times per hour during the rush hour. That happens when a thread
gets stuck waiting for on a semaphore that limits the amount of threads in
the loader (to prevent stuck -> unstuck -> 60 threads fighting for the CPU
and nothing gets done), when the timeout of 6 seconds expire.

That's why we've decided to rewrite every single nio file operation into
stream based ones, it's now done and deployed it on the server for testing,
that should solve our problem for the moment. It's kinda annoying tho, as
most of the code expects a ByteBuffer to work on, so instead of just mapping
it, we have allocate the buffer and fill it with FileInputStream.read() in a
loop, producing extra garbage.

Chi Ho Kwok

On Tue, Jan 19, 2010 at 6:07 AM, Y. Srinivas Ramakrishna <
Y.S.Ramakrishna at sun.com> wrote:

> Hi Chi Ho, Doug --
>
> Thanks for the data, the test case and the summary of yr
> observations. I have not tried the test case yet (US holiday today)
> but will be looking at this carefully this week and update you.
> From the behaviour you have described in email and knowing the
> implementation of GC locker and the dependencies that arise here,
> I have a good hunch as to what is going on here, but will update
> only after I have actually confirmed the huncg (or otherwise
> found the root cause).
>
> Thanks again for calling this in, and I look forward to
> posting an update soon (including the CR opened to track
> this issue).
>
> thanks!
>
> - ramki
>
> Chi Ho Kwok wrote:
>
>> Hi Doug,
>>
>> On Mon, Jan 18, 2010 at 9:32 PM, Jones, Doug <doug.jones2 at hp.com> wrote:
>>
>>> This is a long shot: in the logs below the problem behaviour appears to
>>> start in the abortable-preclean phase. That part of the CMS Collection does
>>> some interesting things, but can I believe be disabled by setting
>>> CMSMaxAbortablePrecleanTime to 0.
>>>
>>> You might like to try running your test program with the
>>> abortable-preclean phase turned off ...
>>>
>>
>> Thanks, setting it to a very low value does help a lot. The test case
>> has been changed to increase reporting accuracy: in
>> DirectBufferStresser.run(), save the time before and after every
>> buffer operation, and logs every call taking longer than 100ms.
>> Disabled the other stderr warnings. The log with abortable preclean
>> time set to zero shows:
>>
>> All OK:
>>
>> Mem: 842, executed 450 direct buffer operations, 1 load operations.
>>
>> CMS Start:
>>
>> 48.033: [GC [1 CMS-initial-mark: 1545976K(1835008K)]
>> 1614526K(2031616K), 0.0135188 secs] [Times: user=0.00 sys=0.00,
>> real=0.01 secs]
>> 48.047: [CMS-concurrent-mark-start]
>>
>> Workers got stuck:
>>
>> Mem: 842, executed 0 direct buffer operations, 3 load operations.
>> 48.505: [CMS-concurrent-mark: 0.458/0.458 secs] [Times: user=0.45
>> sys=0.00, real=0.46 secs]
>> 48.505: [CMS-concurrent-preclean-start]
>> 48.510: [CMS-concurrent-preclean: 0.004/0.004 secs] [Times: user=0.00
>> sys=0.00, real=0.00 secs]
>>
>> Preclean aborted in 0.02s real time:
>>
>> 48.510: [CMS-concurrent-abortable-preclean-start]
>>  CMS: abort preclean due to time 48.530:
>> [CMS-concurrent-abortable-preclean: 0.020/0.020 secs] [Times:
>> user=0.03 sys=0.00, real=0.02 secs]
>> 48.530: [GC[YG occupancy: 83340 K (196608 K)]48.530: [Rescan
>> (parallel) , 0.0095904 secs]48.540: [weak refs processing, 0.0000068
>> secs] [1 CMS-remark: 1545976K(1835008K)] 1629316K(2031616K), 0.0096730
>> secs] [Times: user=0.00 sys=0.00, real=0.01 secs]
>>
>> Sweeping, threads still stuck:
>>
>> 48.540: [CMS-concurrent-sweep-start]
>> Mem: 842, executed 0 direct buffer operations, 2 load operations.
>> 48.782: [CMS-concurrent-sweep: 0.242/0.242 secs] [Times: user=0.27
>> sys=0.00, real=0.24 secs]
>> 48.782: [CMS-concurrent-reset-start]
>>
>> GC is over, CMS-concurrent-reset doesn't block stuff, I guess;
>> everything is back to normal, but the workers complain about the time
>> spent inside one buffer operation.
>>
>> Last buffer operation took 1001 ms
>> Mem: 684, executed 11 direct buffer operations, 3 load operations.
>> 48.791: [CMS-concurrent-reset: 0.009/0.009 secs] [Times: user=0.00
>> sys=0.00, real=0.01 secs]
>> Last buffer operation took 1010 ms
>> Mem: 684, executed 654 direct buffer operations, 2 load operations.
>>
>>
>> This is quite a bit better than "Last buffer operation took 5971 ms"
>> running with the old vm arguments, but still, extrapolating to a 16G
>> heap, CMS will hold up work for about 8 seconds. Less bad than 20+, so
>> I'll be applying this workaround to prevent the last few nio calls
>> being stuck for too long. From what I've read, the abortable preclean
>> is just doing some work in advance for remark so that doesn't take too
>> long and trying to wait for a desired occupancy in Eden. I'll just cap
>> it at one second unless someone yells *don't*.
>>
>>
>> Chi Ho Kwok
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20100124/15517c7a/attachment.html