Extremely long parnew/cms promotion failure scenario?

Fri Oct 19 14:14:51 PDT 2012

Hi John -- Interesting... I was not aware of this issue.
The kernel version i was running is 2.6.18 and, AFAICT, THP came in in
2.6.38, yes?
We also do not explicitly enabled huge pages although we probably should.
The interesting part is that even though there is fairly high system time,
it accounts only for
a quarter of the elapsed time, so it can't just be huge page coalescing
getting in the way,
and even if it is, there must be something bigger afoot here that accounts
for the rest of the
time. I checked meminfo and it showed no huge pages, free, used or reserved
on the system.

thanks!
-- ramki

On Fri, Oct 19, 2012 at 11:13 AM, John O'Brien <jobrien at ieee.org> wrote:

> Srinivas,
>
> I am interested in how this is resolved. Can I clarify that you are
> referring to the GC-- events in the garbage log where the minor gc
> unwinds and turns into a major GC?
>
> I'd like to figure out if this is something different than what I've
> seen before. I have seen GC times blow out is when Transparent Huge
> Pages interfered when it was doing its coalescing at the same time as
> a GC. Can you clarify what kernel version you are on and if huge/large
> pages are enabled?
>
> This may help you but it will help me follow the discussion better.
>
> Thanks,
> John
>
> On Fri, Oct 19, 2012 at 6:36 AM, Charlie Hunt <chunt at salesforce.com>
> wrote:
> > Interesting discussion. :-)
> >
> > Ramki's observation of high context switches to me suggests active
> locking
> > as a possible culprit.  Fwiw, based on your discussion it looks like
> you're
> > headed down a path that makes sense.
> >
> > charlie...
> >
> > On Oct 19, 2012, at 3:40 AM, Srinivas Ramakrishna wrote:
> >
> >
> >
> > On Thu, Oct 18, 2012 at 5:27 PM, Peter B. Kessler
> > <Peter.B.Kessler at oracle.com> wrote:
> >>
> >> When there's no room in the old generation and a worker has filled its
> >> PLAB to capacity, but it still has instances to try to promote, does it
> try
> >> to allocate a new PLAB, and fail?  That would lead to each of the
> workers
> >> eventually failing to allocate a new PLAB for each promotion attempt.
>  IIRC,
> >> PLAB allocation grabs a real lock (since it happens so rarely :-).  In
> the
> >> promotion failure case, that lock could get incandescent.  Maybe it's
> gone
> >> unnoticed because for modest young generations it doesn't stay hot
> enough
> >> for long enough for people to witness the supernova?  Having a young
> >> generation the size you do would exacerbate the problem.  If you have
> lots
> >> of workers, that would increase the amount of contention, too.
> >
> >
> > Yes, that's exactly my thinking too. For the case of CMS, the PLAB's are
> > "local free block lists" and the allocation from the shared global pool
> is
> > even worse and more heavyweight than an atomic pointer bump, with a lock
> > protecting several layers of checks.
> >
> >>
> >>
> >> PLAB allocation might be a place where you could put a test for having
> >> failed promotion, so just return null and let the worker self-loop this
> >> instance.  That would keep the test off the fast-path (when things are
> going
> >> well).
> >
> >
> > Yes, that's a good idea and might well be sufficient, and was also my
> first
> > thought. However, I also wonder about whether just moving the promotion
> > failure test a volatile read into the fast path of the copy routine, and
> > immediately failing all subsequent copies after the first failure (and
> > indeed via the
> > global flag propagating that failure across all the workers immediately)
> > won't just be quicker without having added that much in the fast path. It
> > seems
> > that in that case we may be able to even avoid the self-looping and the
> > subsequent single-threaded fixup. The first thread that fails sets the
> > volatile
> > global, so any subsequent thread artificially fails all subsequent
> copies of
> > uncopied objects. Any object reference found pointing to an object in
> Eden
> > or From space that hasn't yet been copied will call the copy routine
> which
> > will (artificially) fail and return the original address.
> >
> > I'll do some experiments and there may lurk devils in the details, but it
> > seems to me that this will work and be much more efficient in the
> > slow case, without making the fast path that much slower.
> >
> >>
> >>
> >> I'm still guessing.
> >
> >
> > Your guesses are good, and very helpful, and I think we are on the right
> > track with this one as regards the cause of the slowdown.
> >
> > I'll update.
> >
> > -- ramki
> >
> >>
> >>
> >>
> >>                         ... peter
> >>
> >> Srinivas Ramakrishna wrote:
> >>>
> >>> System data show high context switching in vicinity of event and points
> >>> at the futile allocation bottleneck as a possible theory with some
> legs....
> >>>
> >>> more later.
> >>> -- ramki
> >>>
> >>> On Thu, Oct 18, 2012 at 3:47 PM, Srinivas Ramakrishna <
> ysr1729 at gmail.com
> >>> <mailto:ysr1729 at gmail.com>> wrote:
> >>>
> >>>     Thanks Peter... the possibility of paging or related issue of VM
> >>>     system did occur to me, especially because system time shows up as
> >>>     somewhat high here. The problem is that this server runs without
> >>>     swap :-) so the time is going elsewhere.
> >>>
> >>>     The cache miss theory is interesting (but would not show up as
> >>>     system time), and your back of the envelope calculation gives about
> >>>     0.8 us for fetching a cache line, although i am pretty sure the
> >>>     cache miss predictor would probably figure out the misses and
> stream
> >>>     in the
> >>>     cache lines since as you say we are going in address order). I'd
> >>>     expect it to be no worse than when we do an "initial mark pause on
> a
> >>>     full Eden", give or
> >>>     take a little, and this is some 30 x worse.
> >>>
> >>>     One possibility I am looking at is the part where we self-loop. I
> >>>     suspect the ParNew/CMS combination running with multiple worker
> >>> threads
> >>>     is hit hard here, if the failure happens very early say -- from
> what
> >>>     i saw of that code recently, we don't consult the flag that says we
> >>>     failed
> >>>     so we should just return and self-loop. Rather we retry allocation
> >>>     for each subsequent object, fail that and then do the self-loop.
> The
> >>>     repeated
> >>>     failed attempts might be adding up, especially since the access
> >>>     involves looking at the shared pool. I'll look at how that is done,
> >>>     and see if we can
> >>>     do a fast fail after the first failure happens, rather than try and
> >>>     do the rest of the scavenge, since we'll need to do a fixup anyway.
> >>>
> >>>     thanks for the discussion and i'll update as and when i do some
> more
> >>>     investigations. Keep those ideas coming, and I'll submit a bug
> >>>     report once
> >>>     i have spent a few more cycles looking at the available data and
> >>>     ruminating.
> >>>
> >>>     - ramki
> >>>
> >>>
> >>>     On Thu, Oct 18, 2012 at 1:20 PM, Peter B. Kessler
> >>>     <Peter.B.Kessler at oracle.com <mailto:Peter.B.Kessler at oracle.com>>
> >>> wrote:
> >>>
> >>>         IIRC, promotion failure still has to finish the evacuation
> >>>         attempt (and some objects may get promoted while the ones that
> >>>         fail get self-looped).  That part is the usual multi-threaded
> >>>         object graph walk, with failed PLAB allocations thrown in to
> >>>         slow you down.  Then you get to start the pass that deals with
> >>>         the self-loops, which you say is single-threaded.  Undoing the
> >>>         self-loops is in address order, but it walks by the object
> >>>         sizes, so probably it mostly misses in the cache.  40GB at the
> >>>         average object size (call them 40 bytes to make the math easy)
> >>>         is a lot of cache misses.  How fast is your memory system?
> >>>          Probably faster than (10minutes / (40GB / 40bytes)) per cache
> >>> miss.
> >>>
> >>>         Is it possible you are paging?  Maybe not when things are
> >>>         running smoothly, but maybe a 10 minute stall on one service
> >>>         causes things to back up (and grow the heap of) other services
> >>>         on the same machine?  I'm guessing.
> >>>
> >>>                                 ... peter
> >>>
> >>>         Srinivas Ramakrishna wrote:
> >>>
> >>>
> >>>             Has anyone come across extremely long (upwards of 10
> >>>             minutes) promotion failure unwinding scenarios when using
> >>>             any of the collectors, but especially with ParNew/CMS?
> >>>             I recently came across one such occurrence with ParNew/CMS
> >>>             that, with a 40 GB young gen took upwards of 10 minutes to
> >>>             "unwind". I looked through the code and I can see
> >>>             that the unwinding steps can be a source of slowdown as we
> >>>             iterate single-threaded (DefNew) through the large Eden to
> >>>             fix up self-forwarded objects, but that still wouldn't
> >>>             seem to explain such a large pause, even with a 40 GB young
> >>>             gen. I am looking through the promotion failure paths to
> see
> >>>             what might be the cause of such a large pause,
> >>>             but if anyone has experienced this kind of scenario before
> >>>             or has any conjectures or insights, I'd appreciate it.
> >>>
> >>>             thanks!
> >>>             -- ramki
> >>>
> >>>
> >>>
> >>>
> ------------------------------__------------------------------__------------
> >>>
> >>>             _________________________________________________
> >>>             hotspot-gc-use mailing list
> >>>             hotspot-gc-use at openjdk.java.__net
> >>>             <mailto:hotspot-gc-use at openjdk.java.net>
> >>>
> >>> http://mail.openjdk.java.net/__mailman/listinfo/hotspot-gc-__use
> >>>
> >>> <http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use>
> >>>
> >>>
> >>>
> >
> > _______________________________________________
> > hotspot-gc-use mailing list
> > hotspot-gc-use at openjdk.java.net
> > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
> >
> >
> >
> > _______________________________________________
> > hotspot-gc-use mailing list
> > hotspot-gc-use at openjdk.java.net
> > http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20121019/e32975e6/attachment.html