Extremely long parnew/cms promotion failure scenario?

Fri Oct 19 11:13:51 PDT 2012

Srinivas,

I am interested in how this is resolved. Can I clarify that you are
referring to the GC-- events in the garbage log where the minor gc
unwinds and turns into a major GC?

I'd like to figure out if this is something different than what I've
seen before. I have seen GC times blow out is when Transparent Huge
Pages interfered when it was doing its coalescing at the same time as
a GC. Can you clarify what kernel version you are on and if huge/large
pages are enabled?

This may help you but it will help me follow the discussion better.

Thanks,
John

On Fri, Oct 19, 2012 at 6:36 AM, Charlie Hunt <chunt at salesforce.com> wrote:
> Interesting discussion. :-)
>
> Ramki's observation of high context switches to me suggests active locking
> as a possible culprit.  Fwiw, based on your discussion it looks like you're
> headed down a path that makes sense.
>
> charlie...
>
> On Oct 19, 2012, at 3:40 AM, Srinivas Ramakrishna wrote:
>
>
>
> On Thu, Oct 18, 2012 at 5:27 PM, Peter B. Kessler
> <Peter.B.Kessler at oracle.com> wrote:
>>
>> When there's no room in the old generation and a worker has filled its
>> PLAB to capacity, but it still has instances to try to promote, does it try
>> to allocate a new PLAB, and fail?  That would lead to each of the workers
>> eventually failing to allocate a new PLAB for each promotion attempt.  IIRC,
>> PLAB allocation grabs a real lock (since it happens so rarely :-).  In the
>> promotion failure case, that lock could get incandescent.  Maybe it's gone
>> unnoticed because for modest young generations it doesn't stay hot enough
>> for long enough for people to witness the supernova?  Having a young
>> generation the size you do would exacerbate the problem.  If you have lots
>> of workers, that would increase the amount of contention, too.
>
>
> Yes, that's exactly my thinking too. For the case of CMS, the PLAB's are
> "local free block lists" and the allocation from the shared global pool is
> even worse and more heavyweight than an atomic pointer bump, with a lock
> protecting several layers of checks.
>
>>
>>
>> PLAB allocation might be a place where you could put a test for having
>> failed promotion, so just return null and let the worker self-loop this
>> instance.  That would keep the test off the fast-path (when things are going
>> well).
>
>
> Yes, that's a good idea and might well be sufficient, and was also my first
> thought. However, I also wonder about whether just moving the promotion
> failure test a volatile read into the fast path of the copy routine, and
> immediately failing all subsequent copies after the first failure (and
> indeed via the
> global flag propagating that failure across all the workers immediately)
> won't just be quicker without having added that much in the fast path. It
> seems
> that in that case we may be able to even avoid the self-looping and the
> subsequent single-threaded fixup. The first thread that fails sets the
> volatile
> global, so any subsequent thread artificially fails all subsequent copies of
> uncopied objects. Any object reference found pointing to an object in Eden
> or From space that hasn't yet been copied will call the copy routine which
> will (artificially) fail and return the original address.
>
> I'll do some experiments and there may lurk devils in the details, but it
> seems to me that this will work and be much more efficient in the
> slow case, without making the fast path that much slower.
>
>>
>>
>> I'm still guessing.
>
>
> Your guesses are good, and very helpful, and I think we are on the right
> track with this one as regards the cause of the slowdown.
>
> I'll update.
>
> -- ramki
>
>>
>>
>>
>>                         ... peter
>>
>> Srinivas Ramakrishna wrote:
>>>
>>> System data show high context switching in vicinity of event and points
>>> at the futile allocation bottleneck as a possible theory with some legs....
>>>
>>> more later.
>>> -- ramki
>>>
>>> On Thu, Oct 18, 2012 at 3:47 PM, Srinivas Ramakrishna <ysr1729 at gmail.com
>>> <mailto:ysr1729 at gmail.com>> wrote:
>>>
>>>     Thanks Peter... the possibility of paging or related issue of VM
>>>     system did occur to me, especially because system time shows up as
>>>     somewhat high here. The problem is that this server runs without
>>>     swap :-) so the time is going elsewhere.
>>>
>>>     The cache miss theory is interesting (but would not show up as
>>>     system time), and your back of the envelope calculation gives about
>>>     0.8 us for fetching a cache line, although i am pretty sure the
>>>     cache miss predictor would probably figure out the misses and stream
>>>     in the
>>>     cache lines since as you say we are going in address order). I'd
>>>     expect it to be no worse than when we do an "initial mark pause on a
>>>     full Eden", give or
>>>     take a little, and this is some 30 x worse.
>>>
>>>     One possibility I am looking at is the part where we self-loop. I
>>>     suspect the ParNew/CMS combination running with multiple worker
>>> threads
>>>     is hit hard here, if the failure happens very early say -- from what
>>>     i saw of that code recently, we don't consult the flag that says we
>>>     failed
>>>     so we should just return and self-loop. Rather we retry allocation
>>>     for each subsequent object, fail that and then do the self-loop. The
>>>     repeated
>>>     failed attempts might be adding up, especially since the access
>>>     involves looking at the shared pool. I'll look at how that is done,
>>>     and see if we can
>>>     do a fast fail after the first failure happens, rather than try and
>>>     do the rest of the scavenge, since we'll need to do a fixup anyway.
>>>
>>>     thanks for the discussion and i'll update as and when i do some more
>>>     investigations. Keep those ideas coming, and I'll submit a bug
>>>     report once
>>>     i have spent a few more cycles looking at the available data and
>>>     ruminating.
>>>
>>>     - ramki
>>>
>>>
>>>     On Thu, Oct 18, 2012 at 1:20 PM, Peter B. Kessler
>>>     <Peter.B.Kessler at oracle.com <mailto:Peter.B.Kessler at oracle.com>>
>>> wrote:
>>>
>>>         IIRC, promotion failure still has to finish the evacuation
>>>         attempt (and some objects may get promoted while the ones that
>>>         fail get self-looped).  That part is the usual multi-threaded
>>>         object graph walk, with failed PLAB allocations thrown in to
>>>         slow you down.  Then you get to start the pass that deals with
>>>         the self-loops, which you say is single-threaded.  Undoing the
>>>         self-loops is in address order, but it walks by the object
>>>         sizes, so probably it mostly misses in the cache.  40GB at the
>>>         average object size (call them 40 bytes to make the math easy)
>>>         is a lot of cache misses.  How fast is your memory system?
>>>          Probably faster than (10minutes / (40GB / 40bytes)) per cache
>>> miss.
>>>
>>>         Is it possible you are paging?  Maybe not when things are
>>>         running smoothly, but maybe a 10 minute stall on one service
>>>         causes things to back up (and grow the heap of) other services
>>>         on the same machine?  I'm guessing.
>>>
>>>                                 ... peter
>>>
>>>         Srinivas Ramakrishna wrote:
>>>
>>>
>>>             Has anyone come across extremely long (upwards of 10
>>>             minutes) promotion failure unwinding scenarios when using
>>>             any of the collectors, but especially with ParNew/CMS?
>>>             I recently came across one such occurrence with ParNew/CMS
>>>             that, with a 40 GB young gen took upwards of 10 minutes to
>>>             "unwind". I looked through the code and I can see
>>>             that the unwinding steps can be a source of slowdown as we
>>>             iterate single-threaded (DefNew) through the large Eden to
>>>             fix up self-forwarded objects, but that still wouldn't
>>>             seem to explain such a large pause, even with a 40 GB young
>>>             gen. I am looking through the promotion failure paths to see
>>>             what might be the cause of such a large pause,
>>>             but if anyone has experienced this kind of scenario before
>>>             or has any conjectures or insights, I'd appreciate it.
>>>
>>>             thanks!
>>>             -- ramki
>>>
>>>
>>>
>>> ------------------------------__------------------------------__------------
>>>
>>>             _________________________________________________
>>>             hotspot-gc-use mailing list
>>>             hotspot-gc-use at openjdk.java.__net
>>>             <mailto:hotspot-gc-use at openjdk.java.net>
>>>
>>> http://mail.openjdk.java.net/__mailman/listinfo/hotspot-gc-__use
>>>
>>> <http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use>
>>>
>>>
>>>
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
>
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>