Extremely long parnew/cms promotion failure scenario?

Tue Oct 30 13:04:55 PDT 2012

Some comments inline.

          ... peter

Srinivas Ramakrishna wrote:
> Hi Peter, all --
>
> On Fri, Oct 19, 2012 at 1:40 AM, Srinivas Ramakrishna 
> <ysr1729 at gmail.com <mailto:ysr1729 at gmail.com>> wrote:
>
>
>
>     On Thu, Oct 18, 2012 at 5:27 PM, Peter B. Kessler
>     <Peter.B.Kessler at oracle.com <mailto:Peter.B.Kessler at oracle.com>>
>     wrote:
>
>         When there's no room in the old generation and a worker has
>         filled its PLAB to capacity, but it still has instances to try
>         to promote, does it try to allocate a new PLAB, and fail?
>          That would lead to each of the workers eventually failing to
>         allocate a new PLAB for each promotion attempt.  IIRC, PLAB
>         allocation grabs a real lock (since it happens so rarely :-).
>          In the promotion failure case, that lock could get
>         incandescent.  Maybe it's gone unnoticed because for modest
>         young generations it doesn't stay hot enough for long enough
>         for people to witness the supernova?  Having a young
>         generation the size you do would exacerbate the problem.  If
>         you have lots of workers, that would increase the amount of
>         contention, too.
>
>
>     Yes, that's exactly my thinking too. For the case of CMS, the
>     PLAB's are "local free block lists" and the allocation from the
>     shared global pool is
>     even worse and more heavyweight than an atomic pointer bump, with
>     a lock protecting several layers of checks.
>      
>
>
>         PLAB allocation might be a place where you could put a test
>         for having failed promotion, so just return null and let the
>         worker self-loop this instance.  That would keep the test off
>         the fast-path (when things are going well).
>
>
>     Yes, that's a good idea and might well be sufficient, and was also
>     my first thought. However, I also wonder about whether just moving
>     the promotion
>     failure test a volatile read into the fast path of the copy
>     routine, and immediately failing all subsequent copies after the
>     first failure (and indeed via the
>     global flag propagating that failure across all the workers
>     immediately) won't just be quicker without having added that much
>     in the fast path. It seems
>     that in that case we may be able to even avoid the self-looping
>     and the subsequent single-threaded fixup. The first thread that
>     fails sets the volatile
>     global, so any subsequent thread artificially fails all subsequent
>     copies of uncopied objects. Any object reference found pointing to
>     an object in Eden
>     or From space that hasn't yet been copied will call the copy
>     routine which will (artificially) fail and return the original
>     address.
>
>     I'll do some experiments and there may lurk devils in the details,
>     but it seems to me that this will work and be much more efficient
>     in the
>     slow case, without making the fast path that much slower.
>
>
> I got back to thinking about this  issue with a view to implementing 
> the second approach above, but ran into what's an obvious problem 
> which should
> have occurred to me. We need a way of terminating the failure scan 
> while at the same time making sure to update all references that point 
> at old copies
> of objects that have already been copied.
>
> We can do one or the other easily, but not both at the same time 
> without avoiding the linear walk of Eden and From spaces, which I had 
> wanted to avoid
> on account of its single-threadedness.
>
> (1) The simplest termination protocol is that once each thread sees 
> that promotion has failed, it does not attempt to copy the target 
> object (if it has not yet
>       been copied), but simply, returns the object from the copy 
> routine. It also does not push the object on its work stack. Thus the 
> workers will eventually
>      terminate their scan of the card-table and of the objects on 
> their work stacks after having updated any references to moved objects 
> encountered during
>      the remainder of the scan. This can however still leave a subset 
> of the references of objects in the young generation that were not 
> copied pointing to
>      the old copies of objects that were copied. Fixing those would 
> require a (linear) scan of Eden and From survivor spaces (which today 
> is single-threaded,
>      although it conceivably could be made multi-threaded by 
> remembering TLAB allocation boundaries and using those as the parallel 
> chunks of work).
Not having block-offset tables for the eden and survivor spaces make it 
hard to parallelize that walk.  Remembering TLAB boundaries (as object 
starts, as you suggest) is a good idea, except if humongous objects are 
allocated directly in the eden (or from) without the benefit of a TLAB 
allocation.  But that doesn't seem that hard to work into a scheme that 
remembers TLAB boundaries, since it happens infrequently.  (Don't forget 
the To-space: it may also have objects that were copied to it, but which 
won't, in one of your schemes, necessarily have their contents updated.)

>
> (2) If we didn't want to do the linear scan mentioned above, then at 
> termination, we should have scanned all reachable objects in the young 
> gen. In that
>       case, to terminate the scan we would have to remember when the 
> tracing had already scanned an object, so the trace would not need to 
> process that
>       object. Previously, this was done by looking at the location of 
> the object (not in Eden or From Space) or at its header (a forwarding 
> pointer). The use of
>       a self-loop forwarding pointer for failed copies requires 
> another pass to clear it which is the linear scan of Eden and From 
> spaces that we wanted to
>       avoid above. The self-looping also requires evacuation and the 
> later rematerialization of the header contents during the scan, both 
> of which carry
>       additional expense. Even if we had a spare header bit -- which 
> we well might in the 64-bit case, i think, although i haven't looked 
> recently -- we would
>       still need to clear that bit without doing a linear pass over 
> Eden and From spaces. We could conceivably do so as part of the later 
> full gc that is to follow
>       shortly, but that would add at least some bulk via a test to the 
> full gc closures.
Non-prototypical headers have to be evacuated and rematerialized any 
time you want to install forwarding pointers.  Worrying about them might 
be a red herring.  It would be interesting to have some numbers on how 
frequent they are ("It depends.") and how much time and space it takes 
to displace them.

>
> So it almost seems as though the later scan of Eden and From spaces is 
> unavoidable, no matter what. Unless someone is able to think of a clever
> way of avoiding it while still quickly terminating the tracing after 
> signalling global promotion failure.
If you can find two bits instead of just one, you could have an epoch 
counter (modulo 3) that would tell you whether you need to look at any 
particular object during this collection, or reset the bits.  Does that 
help?  I'm not sure what you are using these bits for.  Are you 
expecting back-to-back promotion failures?  Even so, there would be a 
full re-allocation of the eden in between, which would reset the bits in 
the newly-allocated objects.  Keeping the bits outside of the heap would 
let you clear them more efficiently, at the cost of an additional memory 
stall whenever you had to read both the bit for an object and the object 
itself.  If you just read in address-order through the bits and through 
memory, maybe the prefetchers (software and hardware) could be made to work.

I don't like having to look at dead objects in the young generation, 
because there are so many of them.  (Maybe especially in a 40GB young 
generation!)  But I also can't see a way around it (yet :-).  At least 
in the parallel collector, one point of the full young generation scan 
is to change the klass for (runs of) dead objects to be, if I recall 
correctly, int[], so the full collection that's coming doesn't have to 
(redundantly) look at the individual dead objects, and doesn't have to 
think about whether there are any interior pointers in that region.  
Does it help to know that the immediately following full collection is 
going to iterate over all the objects in the heap, dead or alive?  
Tangling the code for the old generation collector with the code for the 
young generation collector doesn't sound like a good idea.

          ... peter
>
> Given that, my first though was to use the approach outlined in (1) 
> where we terminate quickly, then do a linear walk of Eden and From and 
> fix up any stale
> references that were left pointing at old copies of successfully 
> copied objects. The advantage of that appears to be two-fold: firstly 
> the algorithm
> terminates the trace very quickly and does not pay any time or space 
> penalty of self-looping and header evacuation. The disadvantage however is
> that the linear walk might need to do much more work in scanning all 
> objects and fixing their "stale" references, even when those objects may
> themselves be dead.
>
> To evaluate the rough relative tradeoff between the two schemes, one 
> could do the following back of the envelope calculation:
>
> . Suppose that the object graph is random and that some small fraction 
> s of young gen objects y survive a scavenge on average.
> . Then the work done in the linear walk of scheme (1) is O(y + e) 
> where y is the total number of objects, dead or alive, in Eden and 
> From space,
>    and e is the total number of edges (references) in those objects.
> . In the case of scheme (2), the work done in the linear walk is O(y) 
> where y is the total number of objects, dead or alive, in Eden and 
> From space. The
>   space used and the work done for evacuating the non-prototypical 
> headers would be O(f.s.y) where s and y are as defined above, and f is 
> the fraction
>   of survivors that have non-prototypical headers.
>
> Also, the multiplicative constant factor in the big-O for scheme (1) 
> seems to be much higher than for scheme (2): while scheme (2) merely 
> examines
> the header of each object and fixes up the self-loop (or clears the 
> header bit if using a space header bit), scheme (1) needs to iterate 
> over the references
> in each object, examine the header of the target object of that 
> reference and determine if the reference must be updated, and do so if 
> needed.
>
> Thus although (1) may terminate the trace very quickly, the linear 
> scan of Eden and From spaces for the "fix-up" phase is likely to be 
> much more
> expensive.
>
> It could well be that (2) is the superior scheme under those 
> circumstances, which would bring us back to self-looping (or its moral 
> equivalent a header bit).
> The header bit avoids the time and space expense of the spooling of 
> non-prototypical headers. If we optimized the acquisition of spooling 
> space for
> headers, we might be almost on par with the header bit use and it 
> would leave the fast path of normal scavenge unaffected as is the case 
> today.
> (Use of the header bit would need an extra test.)
>
> After thinking over this, it appears as though the simplest route is 
> to do what Peter mentioned earlier, which is to just have the 
> allocation of the promotion
> buffers have fast-fail paths (after promotion failure) that do not 
> take locks, and just cut down on that expense.
>
> Let me know if anyone has any comments/feedback/ideas: otherwise I'll 
> go through the CMS (and ParallelOld) allocation paths and place
> "fast-fail gates" along the slow allocation path where we go to refill 
> the local buffers. (Later, we can also see about the possibility of 
> using a header bit
> to signal a failed copy, rather than a self-loop, and see if it 
> fetches any gains by saving on header spooling while not affecting the 
> fast path too much,
> and of multi-threading the "linear-scan" by using TLAB boundaries of 
> the previous epoch. But those would be later optimizations.)
>
> -- ramki
>
>      
>
>
>         I'm still guessing.
>
>
>     Your guesses are good, and very helpful, and I think we are on the
>     right track with this one as regards the cause of the slowdown.
>
>     I'll update.
>
>     -- ramki
>      
>
>
>
>                                 ... peter
>
>         Srinivas Ramakrishna wrote:
>
>             System data show high context switching in vicinity of
>             event and points at the futile allocation bottleneck as a
>             possible theory with some legs....
>
>             more later.
>             -- ramki
>
>             On Thu, Oct 18, 2012 at 3:47 PM, Srinivas Ramakrishna
>             <ysr1729 at gmail.com <mailto:ysr1729 at gmail.com>
>             <mailto:ysr1729 at gmail.com <mailto:ysr1729 at gmail.com>>> wrote:
>
>                 Thanks Peter... the possibility of paging or related
>             issue of VM
>                 system did occur to me, especially because system time
>             shows up as
>                 somewhat high here. The problem is that this server
>             runs without
>                 swap :-) so the time is going elsewhere.
>
>                 The cache miss theory is interesting (but would not
>             show up as
>                 system time), and your back of the envelope
>             calculation gives about
>                 0.8 us for fetching a cache line, although i am pretty
>             sure the
>                 cache miss predictor would probably figure out the
>             misses and stream
>                 in the
>                 cache lines since as you say we are going in address
>             order). I'd
>                 expect it to be no worse than when we do an "initial
>             mark pause on a
>                 full Eden", give or
>                 take a little, and this is some 30 x worse.
>
>                 One possibility I am looking at is the part where we
>             self-loop. I
>                 suspect the ParNew/CMS combination running with
>             multiple worker threads
>                 is hit hard here, if the failure happens very early
>             say -- from what
>                 i saw of that code recently, we don't consult the flag
>             that says we
>                 failed
>                 so we should just return and self-loop. Rather we
>             retry allocation
>                 for each subsequent object, fail that and then do the
>             self-loop. The
>                 repeated
>                 failed attempts might be adding up, especially since
>             the access
>                 involves looking at the shared pool. I'll look at how
>             that is done,
>                 and see if we can
>                 do a fast fail after the first failure happens, rather
>             than try and
>                 do the rest of the scavenge, since we'll need to do a
>             fixup anyway.
>
>                 thanks for the discussion and i'll update as and when
>             i do some more
>                 investigations. Keep those ideas coming, and I'll
>             submit a bug
>                 report once
>                 i have spent a few more cycles looking at the
>             available data and
>                 ruminating.
>
>                 - ramki
>
>
>                 On Thu, Oct 18, 2012 at 1:20 PM, Peter B. Kessler
>                 <Peter.B.Kessler at oracle.com
>             <mailto:Peter.B.Kessler at oracle.com>
>             <mailto:Peter.B.Kessler at oracle.com
>             <mailto:Peter.B.Kessler at oracle.com>>> wrote:
>
>                     IIRC, promotion failure still has to finish the
>             evacuation
>                     attempt (and some objects may get promoted while
>             the ones that
>                     fail get self-looped).  That part is the usual
>             multi-threaded
>                     object graph walk, with failed PLAB allocations
>             thrown in to
>                     slow you down.  Then you get to start the pass
>             that deals with
>                     the self-loops, which you say is single-threaded.
>              Undoing the
>                     self-loops is in address order, but it walks by
>             the object
>                     sizes, so probably it mostly misses in the cache.
>              40GB at the
>                     average object size (call them 40 bytes to make
>             the math easy)
>                     is a lot of cache misses.  How fast is your memory
>             system?
>                      Probably faster than (10minutes / (40GB /
>             40bytes)) per cache miss.
>
>                     Is it possible you are paging?  Maybe not when
>             things are
>                     running smoothly, but maybe a 10 minute stall on
>             one service
>                     causes things to back up (and grow the heap of)
>             other services
>                     on the same machine?  I'm guessing.
>
>                                             ... peter
>
>                     Srinivas Ramakrishna wrote:
>
>
>                         Has anyone come across extremely long (upwards
>             of 10
>                         minutes) promotion failure unwinding scenarios
>             when using
>                         any of the collectors, but especially with
>             ParNew/CMS?
>                         I recently came across one such occurrence
>             with ParNew/CMS
>                         that, with a 40 GB young gen took upwards of
>             10 minutes to
>                         "unwind". I looked through the code and I can see
>                         that the unwinding steps can be a source of
>             slowdown as we
>                         iterate single-threaded (DefNew) through the
>             large Eden to
>                         fix up self-forwarded objects, but that still
>             wouldn't
>                         seem to explain such a large pause, even with
>             a 40 GB young
>                         gen. I am looking through the promotion
>             failure paths to see
>                         what might be the cause of such a large pause,
>                         but if anyone has experienced this kind of
>             scenario before
>                         or has any conjectures or insights, I'd
>             appreciate it.
>
>                         thanks!
>                         -- ramki
>
>
>                        
>             ------------------------------__------------------------------__------------
>
>                         _________________________________________________
>                         hotspot-gc-use mailing list
>                         hotspot-gc-use at openjdk.java.__net
>                         <mailto:hotspot-gc-use at openjdk.java.net
>             <mailto:hotspot-gc-use at openjdk.java.net>>
>                        
>             http://mail.openjdk.java.net/__mailman/listinfo/hotspot-gc-__use
>                        
>             <http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use>
>
>
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20121030/9c32274c/attachment-0001.html