Strange Long ParNew GC Pauses (Sample Code Included)

Tue Jun 2 17:36:58 UTC 2015

Ramki,

The fix is here

http://cr.openjdk.java.net/~jmasa/8081629/webrev.00/

but is not out for review yet because I'm trying to see
if I can simplify the block of code that does the power
block fixup.   If you have any suggestions, send my way.
Mostly I'd like to get rid of the "more" variable.

Jon

On 06/01/2015 05:45 PM, Srinivas Ramakrishna wrote:
>
> Hi Jon --
>
> I missed this earlier, but just saw the bug you opened for this issue.
> Out of curiosity, is the issue that the fix-up of the "suffix block" 
> doesn't go sufficiently far back in that block, or is it something else?
>
> thanks!
> -- ramki
>
>
> On Thu, May 14, 2015 at 6:35 PM, Jon Masamitsu 
> <jon.masamitsu at oracle.com <mailto:jon.masamitsu at oracle.com>> wrote:
>
>     Jungwoo,
>
>     Thanks for the information.
>
>     There's a bug in the current code which tries to patch up the BOT
>     after
>     a block is split.  I don't understand yet what is wrong but a simple
>     workaround (which is known to be slower than necessary) avoids the
>     much of the long pauses.  After this has been fixed, we should compare
>     the performances of our two implementations.
>
>     Jon
>
>
>
>     On 5/14/2015 2:52 PM, Jungwoo Ha wrote:
>>     We have a faster implementation for card table scan, and on our
>>     implementation ~20s pause goes down to 1~2s with above-mentioned
>>     experiments. It doesn't solve the problem but it alleviates it
>>     (and actually shows that card table scan is faster).
>>     If someone can sponsor this, I will create a patch for the review.
>>
>>
>>     On Wed, May 13, 2015 at 6:36 PM, Srinivas Ramakrishna
>>     <ysr1729 at gmail.com <mailto:ysr1729 at gmail.com>> wrote:
>>
>>         Hi Jon --
>>
>>         thanks, just a comment inline below....
>>
>>         On Wed, May 13, 2015 at 4:06 PM, Jon Masamitsu
>>         <jon.masamitsu at oracle.com <mailto:jon.masamitsu at oracle.com>>
>>         wrote:
>>
>>
>>
>>             On 5/13/2015 11:40 AM, Srinivas Ramakrishna wrote:
>>>             Hi Jon --
>>>
>>>             My expectation is that for any large block spanning
>>>             multiple cards, if you walk the BOT entries for that
>>>             block from lower addresses to higher addresses, the BOT
>>>             entries should be monotone increasing (albeit not
>>>             strictly monotone) until the last card for that block
>>>             (but one). In the example you gave below, is D > C >
>>>             N_words, with the BOT entries in some interior window of
>>>             the same block/object going (low to high addresses in
>>>             the window) like " .... DDDDCCCCDDDD...." ? In other
>>>             words, there are lower-valued entries C within a range
>>>             of higher-valued entries D, with both C and D denoting
>>>             logarithmic entries for the block in question?
>>
>>             Yes, that's what I'm seeing.
>>
>>>
>>>             If you look in the BOT methods, there was a parameter
>>>             "reducing" passed to some of the mark_block() methods
>>>             which was supposed to do some kind of assertion checking
>>>             towards maintenance of this monotonicity, but looking at
>>>             it now, that may not be sufficient and the assert could
>>>             be strengthened at the "end" of the range being marked.
>>>             More worryingly, I believe I had turned off even that
>>>             assert before I left (ca 6uXX?) because it was failing
>>>             at that time, but the corresponding bug hadn't been
>>>             fixed, and seems as though it still isn't. This might
>>>             possibly  have some bearing on the non-monotonicity in
>>>             the entries you see.
>>
>>             Thanks for that bit of information.   So what I'm seeing
>>             is not
>>             necessarily the cause for the long pauses.  But see my
>>             next comment.
>>
>>>
>>>             I'll hopefully get some time to go over this code again.
>>>             I was looking at the code on the train ride into work
>>>             this morning, and trying to remember the details, but
>>>             have not quite delved again into all of its details yet
>>>             or why the assert I mention above should be failing.
>>
>>             After the logarithmic stride back to find the block
>>             containing the
>>             start of the object, there is a object by object walk
>>             forward to find the
>>             to find the start of the object
>>
>>>             558    while (n <= addr) {
>>>                559 debug_only(HeapWord* last = q);   // for debugging
>>>                560      q = n;
>>>                561      n += _sp->block_size(n);
>>             [...]
>>
>>             From instrumentation I think the long pauses originate in
>>             this forward walk
>>             (mostly in the block_size() method).   So maybe the BOT
>>             "DDDDDCCCCC...CDDDD"
>>             indicates that BOT at the start of an objects is being
>>             trashed and we're
>>             going way too far back.  I'm trying to use
>>             VerifyBlockOffsetArray but it seems to
>>             have bit rotted.    I'll keep trying.
>>
>>
>>         Ah, that's what I was afraid of. This must likely mean that
>>         when a large block is split, the suffix from the split
>>         (which is typically returned to the free list) isn't getting
>>         fixed  up as was the intention, or when the sweeper
>>         coalesces smaller blocks into a larger one, the new larger
>>         block isn't getting fixed correctly (although that seems
>>         unlikely). Using VerifyBlockOffsetArray and asserting if the
>>         forward walk is more than a card's worth in length
>>         would seem to be one way to catch that badness closer to when
>>         it happens...
>>
>>         Thanks for the update, Jon!
>>         -- ramki
>>
>>
>>
>>             Jon
>>
>>
>>>
>>>             It might be worthwhile (if it is possible) to do a
>>>             bisect on whatever the first known JDK (7uXX?) where the
>>>             performance regression appears and try to narrow down
>>>             the change that may have caused the regression, using
>>>             James' test program. I have a sneaking suspicion
>>>             (because of the elided assert) that it might well be a
>>>             change I had made towards closing some
>>>             BOT-walking/card-scanning cracks when initializing CMS
>>>             objects ca 6uXX.
>>>
>>>             James, in case you have read thus far, can you provide
>>>             the exact version number of the 6uXX Oracle jdk you used
>>>             that did _not_ show the performance regression for your
>>>             test program?
>>>
>>>             Thanks!
>>>             -- ramki
>>>
>>>             On Tue, May 12, 2015 at 1:50 PM, Jon Masamitsu
>>>             <jon.masamitsu at oracle.com
>>>             <mailto:jon.masamitsu at oracle.com>> wrote:
>>>
>>>                 Ramki,
>>>
>>>                 The logarithmic BOT is taking big steps back to
>>>                 find the start of a large object.  But something is odd.
>>>
>>>                 Is there a reason why the BOT entries would decrease
>>>                 for a while
>>>                 and then increase without a value that gives the
>>>                 start of an object?
>>>
>>>                 &_array->_offset_array[index] = 0xffffffff6550f600
>>>                 "DDDDDDDDDDDDDD ..."
>>>
>>>                 and then down the road.
>>>
>>>                  print &_array->_offset_array[0x110600U],
>>>                 _array->_offset_array[0x10f600U >
>>>                 &_array->_offset_array[0x110600U] =
>>>                 0xffffffff65510600
>>>                 "CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
>>>                 CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDDDDDDDDDDDDDDD
>>>                 DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
>>>                 DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
>>>                 DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
>>>                 DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
>>>                 DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
>>>                 DDDDDDDDDDDDDDDDDDDDDDDDDD
>>>
>>>                 'D' is 0x44
>>>
>>>                 N_words is 0x40
>>>
>>>                 Jon
>>>
>>>
>>>                 On 05/12/2015 08:46 AM, Srinivas Ramakrishna wrote:
>>>>                 Thanks for that background Jon; I'll try and look
>>>>                 at the 8uxx code and run James's example when I get
>>>>                 a few spare cycles (which won't be soon,
>>>>                 unfortunately).
>>>>
>>>>                 -- ramki
>>>>
>>>>                 ysr1729
>>>>
>>>>                 On May 8, 2015, at 14:25, Jon Masamitsu
>>>>                 <jon.masamitsu at oracle.com
>>>>                 <mailto:jon.masamitsu at oracle.com>> wrote:
>>>>
>>>>>                 Ramki,
>>>>>
>>>>>                 I compared the memory/blockOffsetTable.* files
>>>>>                 between 6u35 and 7 and there
>>>>>                 were essentially no differences. The .cpp file had
>>>>>                 some changes in assertions.
>>>>>                 The other two were the same except for comments.
>>>>>
>>>>>                 Between 6 and 6u35 (didn't pin point it), we
>>>>>                 changed from the rather
>>>>>                 weak N_powers=3 :-) to N_powers=14.
>>>>>
>>>>>                 There seem to be two problematic cases.
>>>>>
>>>>>                 1)  Spike in ParNew times that lasts for 2 (or a
>>>>>                 few) GC's
>>>>>                 2) Continually growing ParNew times
>>>>>
>>>>>                 From just reading the CR
>>>>>
>>>>>                 https://bugs.openjdk.java.net/browse/JDK-8079274
>>>>>
>>>>>                 the test case seems to reproduce 1). The test case
>>>>>                 has a large object that lives till the application
>>>>>                 exits.
>>>>>                 If it were a broken logarithmic BOT, I would not
>>>>>                 expect
>>>>>                 the ParNew pauses to shrink again.
>>>>>
>>>>>                 Jon
>>>>>
>>>>>
>>>>>
>>>>>                 On 5/8/2015 11:49 AM, Srinivas Ramakrishna wrote:
>>>>>>                 This is indeed interesting...
>>>>>>
>>>>>>                 Jesper, did you figure out why (as James seems to
>>>>>>                 note) the problem is seen in JDK 8 but not in JDK
>>>>>>                 6 (he's using ParNew in both cases). Also, isn't
>>>>>>                 the BOT logarithmic, which will allow threads to
>>>>>>                 "quickly" traverse back to the start of even a
>>>>>>                 large object? Did something change in the BOT
>>>>>>                 that might have led to what James sees as a
>>>>>>                 regression, or did the range that he
>>>>>>                 characterizes as a "blind spot" merely move from
>>>>>>                 one place to another?
>>>>>>
>>>>>>                 Unfortunately, I have not been keeping up with
>>>>>>                 the changes in this area and am not working with
>>>>>>                 JDK 6 code much, so can't offer any clues, but am
>>>>>>                 surprised that this would be a regression. (I
>>>>>>                 would expect the same performance pathologies in
>>>>>>                 this code to exist in 6 and forward -- has it
>>>>>>                 been established conclusively that this is indeed
>>>>>>                 a regression?)
>>>>>>
>>>>>>                 thanks!
>>>>>>                 -- ramki
>>>>>>
>>>>>>                 On Fri, May 8, 2015 at 6:40 AM, James Lent
>>>>>>                 <jlent at tivo.com <mailto:jlent at tivo.com>> wrote:
>>>>>>
>>>>>>                     The explanation of the problem added by
>>>>>>                     Jesper Wilhelmsson on 2015-05-06 08:59 to
>>>>>>                     JDK-8079274 is interesting, but, not very
>>>>>>                     satisfying.
>>>>>>
>>>>>>                     1) The problem affects more than this one
>>>>>>                     specific breakpoint. It impacts objects in at
>>>>>>                     least the range of about 480M to 512M. The
>>>>>>                     comment "A larger object probably triggers
>>>>>>                     some higher level code that are able to skip
>>>>>>                     scanning of this part of the memory" does not
>>>>>>                     inspire confidence. It makes even more
>>>>>>                     concerned that we do not understand this
>>>>>>                     issue and that there could be other
>>>>>>                     triggers.  It seems like "the higher level
>>>>>>                     code" has a 32M blind spot. Further I can
>>>>>>                     create scenarios where the GC code takes
>>>>>>                     progressively longer to clear the issue.
>>>>>>                     Presumably it completes the lower block scan
>>>>>>                     during each GC cycle.  Why does it take
>>>>>>                     longer and longer each time (my guess is it
>>>>>>                     has something to do with the number of
>>>>>>                     objects moved to survivor space).  If I cache
>>>>>>                     1 in 100 instead of 1 in 10 then there are 11
>>>>>>                     progressively slower ParNewGC (after several
>>>>>>                     fast one) and the issue takes minutes to clear:
>>>>>>
>>>>>>                     $ /usr/lib/jvm/java-7-oracle/bin/java
>>>>>>                     -verbose:gc -XX:+PrintGCTimeStamps
>>>>>>                     -XX:+PrintGCDetails -XX:+UseParNewGC
>>>>>>                     -XX:+UseConcMarkSweepGC -Xmx4G -Xms4G
>>>>>>                     LongParNewPause 536870384 100 100
>>>>>>                     0.308: [GC0.308: [ParNew:
>>>>>>                     272640K->3029K(306688K), 0.0113180 secs]
>>>>>>                     796927K->527316K(4160256K), 0.0114040 secs]
>>>>>>                     [Times: user=0.05 sys=0.00, real=0.02 secs]
>>>>>>                     0.367: [GC0.367: [ParNew:
>>>>>>                     275669K->7766K(306688K), 0.0140840 secs]
>>>>>>                     799956K->532053K(4160256K), 0.0141630 secs]
>>>>>>                     [Times: user=0.04 sys=0.01, real=0.02 secs]
>>>>>>                     0.430: [GC0.430: [ParNew:
>>>>>>                     280406K->10057K(306688K), 0.0120810 secs]
>>>>>>                     804693K->534344K(4160256K), 0.0121690 secs]
>>>>>>                     [Times: user=0.05 sys=0.00, real=0.02 secs]
>>>>>>                     0.490: [GC0.490: [ParNew:
>>>>>>                     282697K->15383K(306688K), 0.0128540 secs]
>>>>>>                     806984K->539670K(4160256K), 0.0129390 secs]
>>>>>>                     [Times: user=0.05 sys=0.00, real=0.02 secs]
>>>>>>                     0.551: [GC0.551: [ParNew:
>>>>>>                     288023K->21006K(306688K), 0.0146480 secs]
>>>>>>                     812310K->545294K(4160256K), 0.0147650 secs]
>>>>>>                     [Times: user=0.05 sys=0.01, real=0.01 secs]
>>>>>>                     0.613: [GC0.613: [ParNew:
>>>>>>                     293646K->26805K(306688K), 0.0141250 secs]
>>>>>>                     817934K->551092K(4160256K), 0.0142050 secs]
>>>>>>                     [Times: user=0.05 sys=0.00, real=0.01 secs]
>>>>>>                     0.675: [GC0.675: [ParNew:
>>>>>>                     299445K->24258K(306688K), 0.4816840 secs]
>>>>>>                     823732K->551436K(4160256K), 0.4817620 secs]
>>>>>>                     [Times: user=1.90 sys=0.00, real=0.48 secs]
>>>>>>                     1.205: [GC1.205: [ParNew:
>>>>>>                     296898K->17914K(306688K), 3.4216550 secs]
>>>>>>                     824076K->547788K(4160256K), 3.4217180 secs]
>>>>>>                     [Times: user=13.50 sys=0.00, real=3.42 secs]
>>>>>>                     4.674: [GC4.674: [ParNew:
>>>>>>                     290554K->17433K(306688K), 6.3406820 secs]
>>>>>>                     820428K->550031K(4160256K), 6.3407410 secs]
>>>>>>                     [Times: user=25.00 sys=0.02, real=6.34 secs]
>>>>>>                     11.062: [GC11.062: [ParNew:
>>>>>>                     290073K->17315K(306688K), 10.9493130 secs]
>>>>>>                     822671K->552636K(4160256K), 10.9493890 secs]
>>>>>>                     [Times: user=42.98 sys=0.03, real=10.95 secs]
>>>>>>                     22.059: [GC22.059: [ParNew:
>>>>>>                     289955K->17758K(306688K), 16.0539650 secs]
>>>>>>                     825276K->555798K(4160256K), 16.0540250 secs]
>>>>>>                     [Times: user=62.44 sys=0.09, real=16.05 secs]
>>>>>>                     38.161: [GC38.161: [ParNew:
>>>>>>                     290398K->17745K(306688K), 20.6900820 secs]
>>>>>>                     828438K->558512K(4160256K), 20.6901420 secs]
>>>>>>                     [Times: user=80.79 sys=0.08, real=20.69 secs]
>>>>>>                     58.898: [GC58.898: [ParNew:
>>>>>>                     290385K->21826K(306688K), 24.6781700 secs]
>>>>>>                     831152K->565310K(4160256K), 24.6782710 secs]
>>>>>>                     [Times: user=95.58 sys=0.05, real=24.68 secs]
>>>>>>                     83.625: [GC83.625: [ParNew:
>>>>>>                     294466K->21834K(306688K), 30.3501800 secs]
>>>>>>                     837950K->568044K(4160256K), 30.3502610 secs]
>>>>>>                     [Times: user=115.80 sys=0.11, real=30.35 secs]
>>>>>>                     114.022: [GC114.022: [ParNew:
>>>>>>                     294474K->21836K(306688K), 33.6648800 secs]
>>>>>>                     840684K->570772K(4160256K), 33.6649390 secs]
>>>>>>                     [Times: user=127.38 sys=0.14, real=33.67 secs]
>>>>>>                     147.736: [GC147.736: [ParNew:
>>>>>>                     294476K->18454K(306688K), 38.9828430 secs]
>>>>>>                     843412K->570117K(4160256K), 38.9829040 secs]
>>>>>>                     [Times: user=149.14 sys=0.15, real=38.99 secs]
>>>>>>                     186.766: [GC186.766: [ParNew:
>>>>>>                     291094K->26602K(306688K), 39.8726730 secs]
>>>>>>                     842757K->580977K(4160256K), 39.8727330 secs]
>>>>>>                     [Times: user=156.35 sys=0.11, real=39.88 secs]
>>>>>>                     226.686: [GC226.686: [ParNew:
>>>>>>                     299242K->26612K(306688K), 0.0174470 secs]
>>>>>>                     853617K->583714K(4160256K), 0.0175300 secs]
>>>>>>                     [Times: user=0.06 sys=0.00, real=0.02 secs]
>>>>>>
>>>>>>                     2) The suggestion to use only 1 thread is not
>>>>>>                     acceptable to me.  Use parallel GC with one
>>>>>>                     thread? Further it did not work for me (in
>>>>>>                     fact it made things worse). "user" time does
>>>>>>                     now equal "real" time (and the new size is
>>>>>>                     1/4 as big) so I think the parameter took
>>>>>>                     effect:
>>>>>>
>>>>>>                     $ /usr/lib/jvm/java-8-oracle/bin/java
>>>>>>                     -verbose:gc -XX:+PrintGCTimeStamps
>>>>>>                     -XX:+PrintGCDetails -XX:+UseParNewGC
>>>>>>                     -XX:+UseConcMarkSweepGC -Xmx4G -Xms4G
>>>>>>                     -XX:ParallelGCThreads=1 LongParNewPause
>>>>>>                     536870384 100 10
>>>>>>                     0.226: [GC (Allocation Failure) 0.226:
>>>>>>                     [ParNew: 68160K->6980K(76672K), 0.0489756
>>>>>>                     secs] 592447K->531267K(4185792K), 0.0490718
>>>>>>                     secs] [Times: user=0.05 sys=0.00, real=0.05
>>>>>>                     secs]
>>>>>>                     0.291: [GC (Allocation Failure) 0.291:
>>>>>>                     [ParNew: 75140K->8512K(76672K), 29.8707281
>>>>>>                     secs] 599427K->539479K(4185792K), 29.8707954
>>>>>>                     secs] [Times: user=29.80 sys=0.00, real=29.88
>>>>>>                     secs]
>>>>>>                     30.180: [GC (Allocation Failure) 30.180:
>>>>>>                     [ParNew: 76672K->8093K(76672K), 30.6357699
>>>>>>                     secs] 607639K->545772K(4185792K), 30.6358257
>>>>>>                     secs] [Times: user=30.57 sys=0.00, real=30.63
>>>>>>                     secs]
>>>>>>                     60.837: [GC (Allocation Failure) 60.837:
>>>>>>                     [ParNew: 76253K->6830K(76672K), 69.2115255
>>>>>>                     secs] 613932K->552169K(4185792K), 69.2115783
>>>>>>                     secs] [Times: user=69.05 sys=0.01, real=69.22
>>>>>>                     secs]
>>>>>>                     130.068: [GC (Allocation Failure) 130.068:
>>>>>>                     [ParNew: 74990K->8226K(76672K), 132.5118903
>>>>>>                     secs] 620329K->560381K(4185792K), 132.5119459
>>>>>>                     secs] [Times: user=132.21 sys=0.00,
>>>>>>                     real=132.52 secs]
>>>>>>                     262.603: [GC (Allocation Failure) 262.603:
>>>>>>                     [ParNew: 76386K->8512K(76672K), 0.0400138
>>>>>>                     secs] 628541K->568748K(4185792K), 0.0400687
>>>>>>                     secs] [Times: user=0.04 sys=0.00, real=0.04
>>>>>>                     secs]
>>>>>>
>>>>>>                     3) Applications should not be expected to
>>>>>>                     tune the GC settings to account for the size
>>>>>>                     of one static object.  The connection between
>>>>>>                     the size of this object and the GC
>>>>>>                     performance was not easy to even find.
>>>>>>                     Further the parameter
>>>>>>                     ParGCCardsPerStrideChunk is pretty low level
>>>>>>                     and not one I think many application
>>>>>>                     developers are familiar with.
>>>>>>
>>>>>>                     Error: VM option 'ParGCCardsPerStrideChunk'
>>>>>>                     is diagnostic and must be enabled via
>>>>>>                     -XX:+UnlockDiagnosticVMOptions.
>>>>>>                     Error: Could not create the Java Virtual Machine.
>>>>>>                     Error: A fatal exception has occurred.
>>>>>>                     Program will exit.
>>>>>>
>>>>>>                     4) I can not reproduce this issue in Oracle
>>>>>>                     6. This appears to be a regression. How can
>>>>>>                     fixing a regression be called an enhancement
>>>>>>                     request?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>                     ------------------------------------------------------------------------
>>>>>>
>>>>>>                     This email and any attachments may contain
>>>>>>                     confidential and privileged material for the
>>>>>>                     sole use of the intended recipient. Any
>>>>>>                     review, copying, or distribution of this
>>>>>>                     email (or any attachments) by others is
>>>>>>                     prohibited. If you are not the intended
>>>>>>                     recipient, please contact the sender
>>>>>>                     immediately and permanently delete this email
>>>>>>                     and any attachments. No employee or agent of
>>>>>>                     TiVo Inc. is authorized to conclude any
>>>>>>                     binding agreement on behalf of TiVo Inc. by
>>>>>>                     email. Binding agreements with TiVo Inc. may
>>>>>>                     only be made by a signed written agreement.
>>>>>>
>>>>>>
>>>>>
>>>
>>>
>>
>>
>>
>>
>>
>>     -- 
>>     Jungwoo Ha | Java Platform Team | jwha at google.com
>>     <mailto:jwha at google.com>
>>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20150602/1db54c69/attachment.htm>