Extremely long parnew/cms promotion failure scenario?

Thu Oct 18 15:47:30 PDT 2012

Thanks Peter... the possibility of paging or related issue of VM system did
occur to me, especially because system time shows up as
somewhat high here. The problem is that this server runs without swap :-)
so the time is going elsewhere.

The cache miss theory is interesting (but would not show up as system
time), and your back of the envelope calculation gives about
0.8 us for fetching a cache line, although i am pretty sure the cache miss
predictor would probably figure out the misses and stream in the
cache lines since as you say we are going in address order). I'd expect it
to be no worse than when we do an "initial mark pause on a full Eden", give
or
take a little, and this is some 30 x worse.

One possibility I am looking at is the part where we self-loop. I suspect
the ParNew/CMS combination running with multiple worker threads
is hit hard here, if the failure happens very early say -- from what i saw
of that code recently, we don't consult the flag that says we failed
so we should just return and self-loop. Rather we retry allocation for each
subsequent object, fail that and then do the self-loop. The repeated
failed attempts might be adding up, especially since the access involves
looking at the shared pool. I'll look at how that is done, and see if we can
do a fast fail after the first failure happens, rather than try and do the
rest of the scavenge, since we'll need to do a fixup anyway.

thanks for the discussion and i'll update as and when i do some more
investigations. Keep those ideas coming, and I'll submit a bug report once
i have spent a few more cycles looking at the available data and ruminating.

- ramki

On Thu, Oct 18, 2012 at 1:20 PM, Peter B. Kessler <
Peter.B.Kessler at oracle.com> wrote:

> IIRC, promotion failure still has to finish the evacuation attempt (and
> some objects may get promoted while the ones that fail get self-looped).
>  That part is the usual multi-threaded object graph walk, with failed PLAB
> allocations thrown in to slow you down.  Then you get to start the pass
> that deals with the self-loops, which you say is single-threaded.  Undoing
> the self-loops is in address order, but it walks by the object sizes, so
> probably it mostly misses in the cache.  40GB at the average object size
> (call them 40 bytes to make the math easy) is a lot of cache misses.  How
> fast is your memory system?  Probably faster than (10minutes / (40GB /
> 40bytes)) per cache miss.
>
> Is it possible you are paging?  Maybe not when things are running
> smoothly, but maybe a 10 minute stall on one service causes things to back
> up (and grow the heap of) other services on the same machine?  I'm guessing.
>
>                         ... peter
>
> Srinivas Ramakrishna wrote:
>
>>
>> Has anyone come across extremely long (upwards of 10 minutes) promotion
>> failure unwinding scenarios when using any of the collectors, but
>> especially with ParNew/CMS?
>> I recently came across one such occurrence with ParNew/CMS that, with a
>> 40 GB young gen took upwards of 10 minutes to "unwind". I looked through
>> the code and I can see
>> that the unwinding steps can be a source of slowdown as we iterate
>> single-threaded (DefNew) through the large Eden to fix up self-forwarded
>> objects, but that still wouldn't
>> seem to explain such a large pause, even with a 40 GB young gen. I am
>> looking through the promotion failure paths to see what might be the cause
>> of such a large pause,
>> but if anyone has experienced this kind of scenario before or has any
>> conjectures or insights, I'd appreciate it.
>>
>> thanks!
>> -- ramki
>>
>>
>> ------------------------------**------------------------------**
>> ------------
>>
>> ______________________________**_________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.**net <hotspot-gc-use at openjdk.java.net>
>> http://mail.openjdk.java.net/**mailman/listinfo/hotspot-gc-**use<http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20121018/e6b9ffa8/attachment.html