Strange Long ParNew GC Pauses (Sample Code Included)

Tue Jun 9 19:29:04 UTC 2015

Sorry, this hid in my inbox. I'll take a look at this tonight and get back
to you.

thanks and sorry for the delay.
-- ramki

On Fri, Jun 5, 2015 at 9:56 AM, Jon Masamitsu <jon.masamitsu at oracle.com>
wrote:

>  Ramki,
>
> The reimplementation for split_block is here.
>
> http://cr.openjdk.java.net/~jmasa/8081629/webrev.01/
>
> When you get a chance, can you consider the premise of
> this simplification.  Whereas the original code based the fix
> up of the power blocks based on the suffix (find the power
> block based on the suffix and fix backwards), the new
> implementation bases the fixup on the prefix (find the
> start of the prefix power block and fix forward).
> Does that sound correct and worth the change?
>
> The webrev.00 has the fix for the problem and we could
> just go with that to minimize the changes.
>
> Thanks.
>
> Jon
>
>
>
> On 6/2/2015 10:36 AM, Jon Masamitsu wrote:
>
> Ramki,
>
> The fix is here
>
> http://cr.openjdk.java.net/~jmasa/8081629/webrev.00/
>
> but is not out for review yet because I'm trying to see
> if I can simplify the block of code that does the power
> block fixup.   If you have any suggestions, send my way.
> Mostly I'd like to get rid of the "more" variable.
>
> Jon
>
> On 06/01/2015 05:45 PM, Srinivas Ramakrishna wrote:
>
>
> Hi Jon --
>
>  I missed this earlier, but just saw the bug you opened for this issue.
> Out of curiosity, is the issue that the fix-up of the "suffix block"
> doesn't go sufficiently far back in that block, or is it something else?
>
>  thanks!
> -- ramki
>
>
> On Thu, May 14, 2015 at 6:35 PM, Jon Masamitsu <jon.masamitsu at oracle.com>
> wrote:
>
>>  Jungwoo,
>>
>> Thanks for the information.
>>
>> There's a bug in the current code which tries to patch up the BOT after
>> a block is split.  I don't understand yet what is wrong but a simple
>> workaround (which is known to be slower than necessary) avoids the
>> much of the long pauses.  After this has been fixed, we should compare
>> the performances of our two implementations.
>>
>> Jon
>>
>>
>>
>> On 5/14/2015 2:52 PM, Jungwoo Ha wrote:
>>
>> We have a faster implementation for card table scan, and on our
>> implementation ~20s pause goes down to 1~2s with above-mentioned
>> experiments. It doesn't solve the problem but it alleviates it (and
>> actually shows that card table scan is faster).
>> If someone can sponsor this, I will create a patch for the review.
>>
>>
>> On Wed, May 13, 2015 at 6:36 PM, Srinivas Ramakrishna <ysr1729 at gmail.com>
>> wrote:
>>
>>> Hi Jon --
>>>
>>>  thanks, just a comment inline below....
>>>
>>> On Wed, May 13, 2015 at 4:06 PM, Jon Masamitsu <jon.masamitsu at oracle.com
>>> > wrote:
>>>
>>>>
>>>>
>>>> On 5/13/2015 11:40 AM, Srinivas Ramakrishna wrote:
>>>>
>>>> Hi Jon --
>>>>
>>>>  My expectation is that for any large block spanning multiple cards,
>>>> if you walk the BOT entries for that block from lower addresses to higher
>>>> addresses, the BOT entries should be monotone increasing (albeit not
>>>> strictly monotone) until the last card for that block (but one). In the
>>>> example you gave below, is D > C > N_words, with the BOT entries in some
>>>> interior window of the same block/object going (low to high addresses in
>>>> the window) like " .... DDDDCCCCDDDD...." ? In other words, there are
>>>> lower-valued entries C within a range of higher-valued entries D, with both
>>>> C and D denoting logarithmic entries for the block in question?
>>>>
>>>>
>>>>  Yes, that's what I'm seeing.
>>>>
>>>>
>>>>  If you look in the BOT methods, there was a parameter "reducing"
>>>> passed to some of the mark_block() methods which was supposed to do some
>>>> kind of assertion checking towards maintenance of this monotonicity, but
>>>> looking at it now, that may not be sufficient and the assert could be
>>>> strengthened at the "end" of the range being marked. More worryingly, I
>>>> believe I had turned off even that assert before I left (ca 6uXX?) because
>>>> it was failing at that time, but the corresponding bug hadn't been fixed,
>>>> and seems as though it still isn't. This might possibly  have some bearing
>>>> on the non-monotonicity in the entries you see.
>>>>
>>>>
>>>>  Thanks for that bit of information.   So what I'm seeing is not
>>>> necessarily the cause for the long pauses.  But see my next comment.
>>>>
>>>>
>>>>  I'll hopefully get some time to go over this code again. I was
>>>> looking at the code on the train ride into work this morning, and trying to
>>>> remember the details, but have not quite delved again into all of its
>>>> details yet or why the assert I mention above should be failing.
>>>>
>>>>
>>>>  After the logarithmic stride back to find the block containing the
>>>> start of the object, there is a object by object walk forward to find
>>>> the
>>>> to find the start of the object
>>>>
>>>>    558    while (n <= addr) {
>>>>    559      debug_only(HeapWord* last = q);   // for debugging
>>>>    560      q = n;
>>>>    561      n += _sp->block_size(n);
>>>>
>>>> [...]
>>>>
>>>> From instrumentation I think the long pauses originate in this forward
>>>> walk
>>>> (mostly in the block_size() method).   So maybe the BOT
>>>> "DDDDDCCCCC...CDDDD"
>>>> indicates that BOT at the start of an objects is being trashed and we're
>>>> going way too far back.  I'm trying to use VerifyBlockOffsetArray but
>>>> it seems to
>>>> have bit rotted.    I'll keep trying.
>>>>
>>>
>>>  Ah, that's what I was afraid of. This must likely mean that when a
>>> large block is split, the suffix from the split
>>> (which is typically returned to the free list) isn't getting fixed  up
>>> as was the intention, or when the sweeper
>>> coalesces smaller blocks into a larger one, the new larger block isn't
>>> getting fixed correctly (although that seems
>>> unlikely). Using VerifyBlockOffsetArray and asserting if the forward
>>> walk is more than a card's worth in length
>>> would seem to be one way to catch that badness closer to when it
>>> happens...
>>>
>>>  Thanks for the update, Jon!
>>> -- ramki
>>>
>>>
>>>>
>>>> Jon
>>>>
>>>>
>>>>
>>>>  It might be worthwhile (if it is possible) to do a bisect on whatever
>>>> the first known JDK (7uXX?) where the performance regression appears and
>>>> try to narrow down the change that may have caused the regression, using
>>>> James' test program. I have a sneaking suspicion (because of the elided
>>>> assert) that it might well be a change I had made towards closing some
>>>> BOT-walking/card-scanning cracks when initializing CMS objects ca 6uXX.
>>>>
>>>>  James, in case you have read thus far, can you provide the exact
>>>> version number of the 6uXX Oracle jdk you used that did _not_ show the
>>>> performance regression for your test program?
>>>>
>>>>  Thanks!
>>>> -- ramki
>>>>
>>>> On Tue, May 12, 2015 at 1:50 PM, Jon Masamitsu <
>>>> jon.masamitsu at oracle.com> wrote:
>>>>
>>>>>  Ramki,
>>>>>
>>>>> The logarithmic BOT is taking big steps back to
>>>>> find the start of a large object.  But something is odd.
>>>>>
>>>>> Is there a reason why the BOT entries would decrease for a while
>>>>> and then increase without a value that gives the start of an object?
>>>>>
>>>>> &_array->_offset_array[index] = 0xffffffff6550f600 "DDDDDDDDDDDDDD ..."
>>>>>
>>>>> and then down the road.
>>>>>
>>>>>  print &_array->_offset_array[0x110600U],
>>>>> _array->_offset_array[0x10f600U >
>>>>> &_array->_offset_array[0x110600U] = 0xffffffff65510600
>>>>> "CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
>>>>> CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDDDDDDDDDDDDDDD
>>>>> DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
>>>>> DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
>>>>> DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
>>>>> DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
>>>>> DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
>>>>> DDDDDDDDDDDDDDDDDDDDDDDDDD
>>>>>
>>>>> 'D' is 0x44
>>>>>
>>>>> N_words is 0x40
>>>>>
>>>>> Jon
>>>>>
>>>>>
>>>>> On 05/12/2015 08:46 AM, Srinivas Ramakrishna wrote:
>>>>>
>>>>> Thanks for that background Jon; I'll try and look at the 8uxx code and
>>>>> run James's example when I get a few spare cycles (which won't be soon,
>>>>> unfortunately).
>>>>>
>>>>>  -- ramki
>>>>>
>>>>> ysr1729
>>>>>
>>>>> On May 8, 2015, at 14:25, Jon Masamitsu <jon.masamitsu at oracle.com>
>>>>> wrote:
>>>>>
>>>>>   Ramki,
>>>>>
>>>>> I compared the memory/blockOffsetTable.* files between 6u35 and 7 and
>>>>> there
>>>>> were essentially no differences.  The .cpp file had some changes in
>>>>> assertions.
>>>>> The other two were the same except for comments.
>>>>>
>>>>> Between 6 and 6u35 (didn't pin point it), we changed from the rather
>>>>> weak N_powers=3  :-) to N_powers=14.
>>>>>
>>>>> There seem to be two problematic cases.
>>>>>
>>>>> 1)  Spike in ParNew times that lasts for 2 (or a few) GC's
>>>>> 2) Continually growing ParNew times
>>>>>
>>>>> From just reading the CR
>>>>>
>>>>> https://bugs.openjdk.java.net/browse/JDK-8079274
>>>>>
>>>>> the test case seems to reproduce 1).  The test case
>>>>> has a large object that lives till the application exits.
>>>>> If it were a broken logarithmic BOT, I would not expect
>>>>> the ParNew pauses to shrink again.
>>>>>
>>>>> Jon
>>>>>
>>>>>
>>>>>
>>>>> On 5/8/2015 11:49 AM, Srinivas Ramakrishna wrote:
>>>>>
>>>>> This is indeed interesting...
>>>>>
>>>>>  Jesper, did you figure out why (as James seems to note) the problem
>>>>> is seen in JDK 8 but not in JDK 6 (he's using ParNew in both cases). Also,
>>>>> isn't the BOT logarithmic, which will allow threads to "quickly" traverse
>>>>> back to the start of even a large object? Did something change in the BOT
>>>>> that might have led to what James sees as a regression, or did the range
>>>>> that he characterizes as a "blind spot" merely move from one place to
>>>>> another?
>>>>>
>>>>>  Unfortunately, I have not been keeping up with the changes in this
>>>>> area and am not working with JDK 6 code much, so can't offer any clues, but
>>>>> am surprised that this would be a regression. (I would expect the same
>>>>> performance pathologies in this code to exist in 6 and forward -- has it
>>>>> been established conclusively that this is indeed a regression?)
>>>>>
>>>>>  thanks!
>>>>> -- ramki
>>>>>
>>>>> On Fri, May 8, 2015 at 6:40 AM, James Lent <jlent at tivo.com> wrote:
>>>>>
>>>>>>  The explanation of the problem added by Jesper Wilhelmsson on 2015-05-06
>>>>>> 08:59 to JDK-8079274 is interesting, but, not very satisfying.
>>>>>>
>>>>>> 1) The problem affects more than this one specific breakpoint.  It
>>>>>> impacts objects in at least the range of about 480M to 512M.  The comment
>>>>>> "A larger object probably triggers some higher level code that are able to
>>>>>> skip scanning of this part of the memory" does not inspire confidence. It
>>>>>> makes even more concerned that we do not understand this issue and that
>>>>>> there could be other triggers.  It seems like "the higher level code" has a
>>>>>> 32M blind spot.  Further I can create scenarios where the GC code takes
>>>>>> progressively longer to clear the issue.  Presumably it completes the lower
>>>>>> block scan during each GC cycle.  Why does it take longer and longer each
>>>>>> time (my guess is it has something to do with the number of objects moved
>>>>>> to survivor space).  If I cache 1 in 100 instead of 1 in 10 then there are
>>>>>> 11 progressively slower ParNewGC (after several fast one) and the issue
>>>>>> takes minutes to clear:
>>>>>>
>>>>>> $ /usr/lib/jvm/java-7-oracle/bin/java -verbose:gc
>>>>>> -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -XX:+UseParNewGC
>>>>>> -XX:+UseConcMarkSweepGC -Xmx4G -Xms4G LongParNewPause 536870384 100 100
>>>>>> 0.308: [GC0.308: [ParNew: 272640K->3029K(306688K), 0.0113180 secs]
>>>>>> 796927K->527316K(4160256K), 0.0114040 secs] [Times: user=0.05 sys=0.00,
>>>>>> real=0.02 secs]
>>>>>> 0.367: [GC0.367: [ParNew: 275669K->7766K(306688K), 0.0140840 secs]
>>>>>> 799956K->532053K(4160256K), 0.0141630 secs] [Times: user=0.04 sys=0.01,
>>>>>> real=0.02 secs]
>>>>>> 0.430: [GC0.430: [ParNew: 280406K->10057K(306688K), 0.0120810 secs]
>>>>>> 804693K->534344K(4160256K), 0.0121690 secs] [Times: user=0.05 sys=0.00,
>>>>>> real=0.02 secs]
>>>>>> 0.490: [GC0.490: [ParNew: 282697K->15383K(306688K), 0.0128540 secs]
>>>>>> 806984K->539670K(4160256K), 0.0129390 secs] [Times: user=0.05 sys=0.00,
>>>>>> real=0.02 secs]
>>>>>> 0.551: [GC0.551: [ParNew: 288023K->21006K(306688K), 0.0146480 secs]
>>>>>> 812310K->545294K(4160256K), 0.0147650 secs] [Times: user=0.05 sys=0.01,
>>>>>> real=0.01 secs]
>>>>>> 0.613: [GC0.613: [ParNew: 293646K->26805K(306688K), 0.0141250 secs]
>>>>>> 817934K->551092K(4160256K), 0.0142050 secs] [Times: user=0.05 sys=0.00,
>>>>>> real=0.01 secs]
>>>>>> 0.675: [GC0.675: [ParNew: 299445K->24258K(306688K), 0.4816840 secs]
>>>>>> 823732K->551436K(4160256K), 0.4817620 secs] [Times: user=1.90 sys=0.00,
>>>>>> real=0.48 secs]
>>>>>> 1.205: [GC1.205: [ParNew: 296898K->17914K(306688K), 3.4216550 secs]
>>>>>> 824076K->547788K(4160256K), 3.4217180 secs] [Times: user=13.50 sys=0.00,
>>>>>> real=3.42 secs]
>>>>>> 4.674: [GC4.674: [ParNew: 290554K->17433K(306688K), 6.3406820 secs]
>>>>>> 820428K->550031K(4160256K), 6.3407410 secs] [Times: user=25.00 sys=0.02,
>>>>>> real=6.34 secs]
>>>>>> 11.062: [GC11.062: [ParNew: 290073K->17315K(306688K), 10.9493130
>>>>>> secs] 822671K->552636K(4160256K), 10.9493890 secs] [Times: user=42.98
>>>>>> sys=0.03, real=10.95 secs]
>>>>>> 22.059: [GC22.059: [ParNew: 289955K->17758K(306688K), 16.0539650
>>>>>> secs] 825276K->555798K(4160256K), 16.0540250 secs] [Times: user=62.44
>>>>>> sys=0.09, real=16.05 secs]
>>>>>> 38.161: [GC38.161: [ParNew: 290398K->17745K(306688K), 20.6900820
>>>>>> secs] 828438K->558512K(4160256K), 20.6901420 secs] [Times: user=80.79
>>>>>> sys=0.08, real=20.69 secs]
>>>>>> 58.898: [GC58.898: [ParNew: 290385K->21826K(306688K), 24.6781700
>>>>>> secs] 831152K->565310K(4160256K), 24.6782710 secs] [Times: user=95.58
>>>>>> sys=0.05, real=24.68 secs]
>>>>>> 83.625: [GC83.625: [ParNew: 294466K->21834K(306688K), 30.3501800
>>>>>> secs] 837950K->568044K(4160256K), 30.3502610 secs] [Times: user=115.80
>>>>>> sys=0.11, real=30.35 secs]
>>>>>> 114.022: [GC114.022: [ParNew: 294474K->21836K(306688K), 33.6648800
>>>>>> secs] 840684K->570772K(4160256K), 33.6649390 secs] [Times: user=127.38
>>>>>> sys=0.14, real=33.67 secs]
>>>>>> 147.736: [GC147.736: [ParNew: 294476K->18454K(306688K), 38.9828430
>>>>>> secs] 843412K->570117K(4160256K), 38.9829040 secs] [Times: user=149.14
>>>>>> sys=0.15, real=38.99 secs]
>>>>>> 186.766: [GC186.766: [ParNew: 291094K->26602K(306688K), 39.8726730
>>>>>> secs] 842757K->580977K(4160256K), 39.8727330 secs] [Times: user=156.35
>>>>>> sys=0.11, real=39.88 secs]
>>>>>> 226.686: [GC226.686: [ParNew: 299242K->26612K(306688K), 0.0174470
>>>>>> secs] 853617K->583714K(4160256K), 0.0175300 secs] [Times: user=0.06
>>>>>> sys=0.00, real=0.02 secs]
>>>>>>
>>>>>> 2) The suggestion to use only 1 thread is not acceptable to me.  Use
>>>>>> parallel GC with one thread?  Further it did not work for me (in fact it
>>>>>> made things worse).  "user" time does now equal "real" time (and the new
>>>>>> size is 1/4 as big) so I think the parameter took effect:
>>>>>>
>>>>>> $ /usr/lib/jvm/java-8-oracle/bin/java -verbose:gc
>>>>>> -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -XX:+UseParNewGC
>>>>>> -XX:+UseConcMarkSweepGC -Xmx4G -Xms4G -XX:ParallelGCThreads=1
>>>>>> LongParNewPause 536870384 100 10
>>>>>> 0.226: [GC (Allocation Failure) 0.226: [ParNew:
>>>>>> 68160K->6980K(76672K), 0.0489756 secs] 592447K->531267K(4185792K),
>>>>>> 0.0490718 secs] [Times: user=0.05 sys=0.00, real=0.05 secs]
>>>>>> 0.291: [GC (Allocation Failure) 0.291: [ParNew:
>>>>>> 75140K->8512K(76672K), 29.8707281 secs] 599427K->539479K(4185792K),
>>>>>> 29.8707954 secs] [Times: user=29.80 sys=0.00, real=29.88 secs]
>>>>>> 30.180: [GC (Allocation Failure) 30.180: [ParNew:
>>>>>> 76672K->8093K(76672K), 30.6357699 secs] 607639K->545772K(4185792K),
>>>>>> 30.6358257 secs] [Times: user=30.57 sys=0.00, real=30.63 secs]
>>>>>> 60.837: [GC (Allocation Failure) 60.837: [ParNew:
>>>>>> 76253K->6830K(76672K), 69.2115255 secs] 613932K->552169K(4185792K),
>>>>>> 69.2115783 secs] [Times: user=69.05 sys=0.01, real=69.22 secs]
>>>>>> 130.068: [GC (Allocation Failure) 130.068: [ParNew:
>>>>>> 74990K->8226K(76672K), 132.5118903 secs] 620329K->560381K(4185792K),
>>>>>> 132.5119459 secs] [Times: user=132.21 sys=0.00, real=132.52 secs]
>>>>>> 262.603: [GC (Allocation Failure) 262.603: [ParNew:
>>>>>> 76386K->8512K(76672K), 0.0400138 secs] 628541K->568748K(4185792K),
>>>>>> 0.0400687 secs] [Times: user=0.04 sys=0.00, real=0.04 secs]
>>>>>>
>>>>>> 3) Applications should not be expected to tune the GC settings to
>>>>>> account for the size of one static object.  The connection between the size
>>>>>> of this object and the GC performance was not easy to even find.  Further
>>>>>> the parameter ParGCCardsPerStrideChunk is pretty low level and not one I
>>>>>> think many application developers are familiar with.
>>>>>>
>>>>>> Error: VM option 'ParGCCardsPerStrideChunk' is diagnostic and must be
>>>>>> enabled via -XX:+UnlockDiagnosticVMOptions.
>>>>>> Error: Could not create the Java Virtual Machine.
>>>>>> Error: A fatal exception has occurred. Program will exit.
>>>>>>
>>>>>>  4) I can not reproduce this issue in Oracle 6.  This appears to be a
>>>>>> regression.  How can fixing a regression be called an enhancement request?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------
>>>>>>
>>>>>> This email and any attachments may contain confidential and
>>>>>> privileged material for the sole use of the intended recipient. Any review,
>>>>>> copying, or distribution of this email (or any attachments) by others is
>>>>>> prohibited. If you are not the intended recipient, please contact the
>>>>>> sender immediately and permanently delete this email and any attachments.
>>>>>> No employee or agent of TiVo Inc. is authorized to conclude any binding
>>>>>> agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo
>>>>>> Inc. may only be made by a signed written agreement.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>>  --
>>  Jungwoo Ha | Java Platform Team | jwha at google.com
>>
>>
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20150609/df623d9f/attachment.htm>