Strange Long ParNew GC Pauses (Sample Code Included)

Thu May 14 21:52:48 UTC 2015

We have a faster implementation for card table scan, and on our
implementation ~20s pause goes down to 1~2s with above-mentioned
experiments. It doesn't solve the problem but it alleviates it (and
actually shows that card table scan is faster).
If someone can sponsor this, I will create a patch for the review.

On Wed, May 13, 2015 at 6:36 PM, Srinivas Ramakrishna <ysr1729 at gmail.com>
wrote:

> Hi Jon --
>
> thanks, just a comment inline below....
>
> On Wed, May 13, 2015 at 4:06 PM, Jon Masamitsu <jon.masamitsu at oracle.com>
> wrote:
>
>>
>>
>> On 5/13/2015 11:40 AM, Srinivas Ramakrishna wrote:
>>
>> Hi Jon --
>>
>>  My expectation is that for any large block spanning multiple cards, if
>> you walk the BOT entries for that block from lower addresses to higher
>> addresses, the BOT entries should be monotone increasing (albeit not
>> strictly monotone) until the last card for that block (but one). In the
>> example you gave below, is D > C > N_words, with the BOT entries in some
>> interior window of the same block/object going (low to high addresses in
>> the window) like " .... DDDDCCCCDDDD...." ? In other words, there are
>> lower-valued entries C within a range of higher-valued entries D, with both
>> C and D denoting logarithmic entries for the block in question?
>>
>>
>> Yes, that's what I'm seeing.
>>
>>
>>  If you look in the BOT methods, there was a parameter "reducing" passed
>> to some of the mark_block() methods which was supposed to do some kind of
>> assertion checking towards maintenance of this monotonicity, but looking at
>> it now, that may not be sufficient and the assert could be strengthened at
>> the "end" of the range being marked. More worryingly, I believe I had
>> turned off even that assert before I left (ca 6uXX?) because it was failing
>> at that time, but the corresponding bug hadn't been fixed, and seems as
>> though it still isn't. This might possibly  have some bearing on the
>> non-monotonicity in the entries you see.
>>
>>
>> Thanks for that bit of information.   So what I'm seeing is not
>> necessarily the cause for the long pauses.  But see my next comment.
>>
>>
>>  I'll hopefully get some time to go over this code again. I was looking
>> at the code on the train ride into work this morning, and trying to
>> remember the details, but have not quite delved again into all of its
>> details yet or why the assert I mention above should be failing.
>>
>>
>> After the logarithmic stride back to find the block containing the
>> start of the object, there is a object by object walk forward to find the
>> to find the start of the object
>>
>>    558    while (n <= addr) {
>>    559      debug_only(HeapWord* last = q);   // for debugging
>>    560      q = n;
>>    561      n += _sp->block_size(n);
>>
>> [...]
>>
>> From instrumentation I think the long pauses originate in this forward
>> walk
>> (mostly in the block_size() method).   So maybe the BOT
>> "DDDDDCCCCC...CDDDD"
>> indicates that BOT at the start of an objects is being trashed and we're
>> going way too far back.  I'm trying to use VerifyBlockOffsetArray but it
>> seems to
>> have bit rotted.    I'll keep trying.
>>
>
> Ah, that's what I was afraid of. This must likely mean that when a large
> block is split, the suffix from the split
> (which is typically returned to the free list) isn't getting fixed  up as
> was the intention, or when the sweeper
> coalesces smaller blocks into a larger one, the new larger block isn't
> getting fixed correctly (although that seems
> unlikely). Using VerifyBlockOffsetArray and asserting if the forward walk
> is more than a card's worth in length
> would seem to be one way to catch that badness closer to when it happens...
>
> Thanks for the update, Jon!
> -- ramki
>
>
>>
>> Jon
>>
>>
>>
>>  It might be worthwhile (if it is possible) to do a bisect on whatever
>> the first known JDK (7uXX?) where the performance regression appears and
>> try to narrow down the change that may have caused the regression, using
>> James' test program. I have a sneaking suspicion (because of the elided
>> assert) that it might well be a change I had made towards closing some
>> BOT-walking/card-scanning cracks when initializing CMS objects ca 6uXX.
>>
>>  James, in case you have read thus far, can you provide the exact
>> version number of the 6uXX Oracle jdk you used that did _not_ show the
>> performance regression for your test program?
>>
>>  Thanks!
>> -- ramki
>>
>> On Tue, May 12, 2015 at 1:50 PM, Jon Masamitsu <jon.masamitsu at oracle.com>
>> wrote:
>>
>>>  Ramki,
>>>
>>> The logarithmic BOT is taking big steps back to
>>> find the start of a large object.  But something is odd.
>>>
>>> Is there a reason why the BOT entries would decrease for a while
>>> and then increase without a value that gives the start of an object?
>>>
>>> &_array->_offset_array[index] = 0xffffffff6550f600 "DDDDDDDDDDDDDD ..."
>>>
>>> and then down the road.
>>>
>>>  print &_array->_offset_array[0x110600U],
>>> _array->_offset_array[0x10f600U >
>>> &_array->_offset_array[0x110600U] = 0xffffffff65510600
>>> "CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
>>> CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDDDDDDDDDDDDDDDD
>>> DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
>>> DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
>>> DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
>>> DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
>>> DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
>>> DDDDDDDDDDDDDDDDDDDDDDDDDD
>>>
>>> 'D' is 0x44
>>>
>>> N_words is 0x40
>>>
>>> Jon
>>>
>>>
>>> On 05/12/2015 08:46 AM, Srinivas Ramakrishna wrote:
>>>
>>> Thanks for that background Jon; I'll try and look at the 8uxx code and
>>> run James's example when I get a few spare cycles (which won't be soon,
>>> unfortunately).
>>>
>>>  -- ramki
>>>
>>> ysr1729
>>>
>>> On May 8, 2015, at 14:25, Jon Masamitsu <jon.masamitsu at oracle.com>
>>> wrote:
>>>
>>>   Ramki,
>>>
>>> I compared the memory/blockOffsetTable.* files between 6u35 and 7 and
>>> there
>>> were essentially no differences.  The .cpp file had some changes in
>>> assertions.
>>> The other two were the same except for comments.
>>>
>>> Between 6 and 6u35 (didn't pin point it), we changed from the rather
>>> weak N_powers=3  :-) to N_powers=14.
>>>
>>> There seem to be two problematic cases.
>>>
>>> 1)  Spike in ParNew times that lasts for 2 (or a few) GC's
>>> 2) Continually growing ParNew times
>>>
>>> From just reading the CR
>>>
>>> https://bugs.openjdk.java.net/browse/JDK-8079274
>>>
>>> the test case seems to reproduce 1).  The test case
>>> has a large object that lives till the application exits.
>>> If it were a broken logarithmic BOT, I would not expect
>>> the ParNew pauses to shrink again.
>>>
>>> Jon
>>>
>>>
>>>
>>> On 5/8/2015 11:49 AM, Srinivas Ramakrishna wrote:
>>>
>>> This is indeed interesting...
>>>
>>>  Jesper, did you figure out why (as James seems to note) the problem is
>>> seen in JDK 8 but not in JDK 6 (he's using ParNew in both cases). Also,
>>> isn't the BOT logarithmic, which will allow threads to "quickly" traverse
>>> back to the start of even a large object? Did something change in the BOT
>>> that might have led to what James sees as a regression, or did the range
>>> that he characterizes as a "blind spot" merely move from one place to
>>> another?
>>>
>>>  Unfortunately, I have not been keeping up with the changes in this
>>> area and am not working with JDK 6 code much, so can't offer any clues, but
>>> am surprised that this would be a regression. (I would expect the same
>>> performance pathologies in this code to exist in 6 and forward -- has it
>>> been established conclusively that this is indeed a regression?)
>>>
>>>  thanks!
>>> -- ramki
>>>
>>> On Fri, May 8, 2015 at 6:40 AM, James Lent <jlent at tivo.com> wrote:
>>>
>>>>  The explanation of the problem added by Jesper Wilhelmsson on 2015-05-06
>>>> 08:59 to JDK-8079274 is interesting, but, not very satisfying.
>>>>
>>>> 1) The problem affects more than this one specific breakpoint.  It
>>>> impacts objects in at least the range of about 480M to 512M.  The comment
>>>> "A larger object probably triggers some higher level code that are able to
>>>> skip scanning of this part of the memory" does not inspire confidence. It
>>>> makes even more concerned that we do not understand this issue and that
>>>> there could be other triggers.  It seems like "the higher level code" has a
>>>> 32M blind spot.  Further I can create scenarios where the GC code takes
>>>> progressively longer to clear the issue.  Presumably it completes the lower
>>>> block scan during each GC cycle.  Why does it take longer and longer each
>>>> time (my guess is it has something to do with the number of objects moved
>>>> to survivor space).  If I cache 1 in 100 instead of 1 in 10 then there are
>>>> 11 progressively slower ParNewGC (after several fast one) and the issue
>>>> takes minutes to clear:
>>>>
>>>> $ /usr/lib/jvm/java-7-oracle/bin/java -verbose:gc
>>>> -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -XX:+UseParNewGC
>>>> -XX:+UseConcMarkSweepGC -Xmx4G -Xms4G LongParNewPause 536870384 100 100
>>>> 0.308: [GC0.308: [ParNew: 272640K->3029K(306688K), 0.0113180 secs]
>>>> 796927K->527316K(4160256K), 0.0114040 secs] [Times: user=0.05 sys=0.00,
>>>> real=0.02 secs]
>>>> 0.367: [GC0.367: [ParNew: 275669K->7766K(306688K), 0.0140840 secs]
>>>> 799956K->532053K(4160256K), 0.0141630 secs] [Times: user=0.04 sys=0.01,
>>>> real=0.02 secs]
>>>> 0.430: [GC0.430: [ParNew: 280406K->10057K(306688K), 0.0120810 secs]
>>>> 804693K->534344K(4160256K), 0.0121690 secs] [Times: user=0.05 sys=0.00,
>>>> real=0.02 secs]
>>>> 0.490: [GC0.490: [ParNew: 282697K->15383K(306688K), 0.0128540 secs]
>>>> 806984K->539670K(4160256K), 0.0129390 secs] [Times: user=0.05 sys=0.00,
>>>> real=0.02 secs]
>>>> 0.551: [GC0.551: [ParNew: 288023K->21006K(306688K), 0.0146480 secs]
>>>> 812310K->545294K(4160256K), 0.0147650 secs] [Times: user=0.05 sys=0.01,
>>>> real=0.01 secs]
>>>> 0.613: [GC0.613: [ParNew: 293646K->26805K(306688K), 0.0141250 secs]
>>>> 817934K->551092K(4160256K), 0.0142050 secs] [Times: user=0.05 sys=0.00,
>>>> real=0.01 secs]
>>>> 0.675: [GC0.675: [ParNew: 299445K->24258K(306688K), 0.4816840 secs]
>>>> 823732K->551436K(4160256K), 0.4817620 secs] [Times: user=1.90 sys=0.00,
>>>> real=0.48 secs]
>>>> 1.205: [GC1.205: [ParNew: 296898K->17914K(306688K), 3.4216550 secs]
>>>> 824076K->547788K(4160256K), 3.4217180 secs] [Times: user=13.50 sys=0.00,
>>>> real=3.42 secs]
>>>> 4.674: [GC4.674: [ParNew: 290554K->17433K(306688K), 6.3406820 secs]
>>>> 820428K->550031K(4160256K), 6.3407410 secs] [Times: user=25.00 sys=0.02,
>>>> real=6.34 secs]
>>>> 11.062: [GC11.062: [ParNew: 290073K->17315K(306688K), 10.9493130 secs]
>>>> 822671K->552636K(4160256K), 10.9493890 secs] [Times: user=42.98 sys=0.03,
>>>> real=10.95 secs]
>>>> 22.059: [GC22.059: [ParNew: 289955K->17758K(306688K), 16.0539650 secs]
>>>> 825276K->555798K(4160256K), 16.0540250 secs] [Times: user=62.44 sys=0.09,
>>>> real=16.05 secs]
>>>> 38.161: [GC38.161: [ParNew: 290398K->17745K(306688K), 20.6900820 secs]
>>>> 828438K->558512K(4160256K), 20.6901420 secs] [Times: user=80.79 sys=0.08,
>>>> real=20.69 secs]
>>>> 58.898: [GC58.898: [ParNew: 290385K->21826K(306688K), 24.6781700 secs]
>>>> 831152K->565310K(4160256K), 24.6782710 secs] [Times: user=95.58 sys=0.05,
>>>> real=24.68 secs]
>>>> 83.625: [GC83.625: [ParNew: 294466K->21834K(306688K), 30.3501800 secs]
>>>> 837950K->568044K(4160256K), 30.3502610 secs] [Times: user=115.80 sys=0.11,
>>>> real=30.35 secs]
>>>> 114.022: [GC114.022: [ParNew: 294474K->21836K(306688K), 33.6648800
>>>> secs] 840684K->570772K(4160256K), 33.6649390 secs] [Times: user=127.38
>>>> sys=0.14, real=33.67 secs]
>>>> 147.736: [GC147.736: [ParNew: 294476K->18454K(306688K), 38.9828430
>>>> secs] 843412K->570117K(4160256K), 38.9829040 secs] [Times: user=149.14
>>>> sys=0.15, real=38.99 secs]
>>>> 186.766: [GC186.766: [ParNew: 291094K->26602K(306688K), 39.8726730
>>>> secs] 842757K->580977K(4160256K), 39.8727330 secs] [Times: user=156.35
>>>> sys=0.11, real=39.88 secs]
>>>> 226.686: [GC226.686: [ParNew: 299242K->26612K(306688K), 0.0174470 secs]
>>>> 853617K->583714K(4160256K), 0.0175300 secs] [Times: user=0.06 sys=0.00,
>>>> real=0.02 secs]
>>>>
>>>> 2) The suggestion to use only 1 thread is not acceptable to me.  Use
>>>> parallel GC with one thread?  Further it did not work for me (in fact it
>>>> made things worse).  "user" time does now equal "real" time (and the new
>>>> size is 1/4 as big) so I think the parameter took effect:
>>>>
>>>> $ /usr/lib/jvm/java-8-oracle/bin/java -verbose:gc
>>>> -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -XX:+UseParNewGC
>>>> -XX:+UseConcMarkSweepGC -Xmx4G -Xms4G -XX:ParallelGCThreads=1
>>>> LongParNewPause 536870384 100 10
>>>> 0.226: [GC (Allocation Failure) 0.226: [ParNew: 68160K->6980K(76672K),
>>>> 0.0489756 secs] 592447K->531267K(4185792K), 0.0490718 secs] [Times:
>>>> user=0.05 sys=0.00, real=0.05 secs]
>>>> 0.291: [GC (Allocation Failure) 0.291: [ParNew: 75140K->8512K(76672K),
>>>> 29.8707281 secs] 599427K->539479K(4185792K), 29.8707954 secs] [Times:
>>>> user=29.80 sys=0.00, real=29.88 secs]
>>>> 30.180: [GC (Allocation Failure) 30.180: [ParNew:
>>>> 76672K->8093K(76672K), 30.6357699 secs] 607639K->545772K(4185792K),
>>>> 30.6358257 secs] [Times: user=30.57 sys=0.00, real=30.63 secs]
>>>> 60.837: [GC (Allocation Failure) 60.837: [ParNew:
>>>> 76253K->6830K(76672K), 69.2115255 secs] 613932K->552169K(4185792K),
>>>> 69.2115783 secs] [Times: user=69.05 sys=0.01, real=69.22 secs]
>>>> 130.068: [GC (Allocation Failure) 130.068: [ParNew:
>>>> 74990K->8226K(76672K), 132.5118903 secs] 620329K->560381K(4185792K),
>>>> 132.5119459 secs] [Times: user=132.21 sys=0.00, real=132.52 secs]
>>>> 262.603: [GC (Allocation Failure) 262.603: [ParNew:
>>>> 76386K->8512K(76672K), 0.0400138 secs] 628541K->568748K(4185792K),
>>>> 0.0400687 secs] [Times: user=0.04 sys=0.00, real=0.04 secs]
>>>>
>>>> 3) Applications should not be expected to tune the GC settings to
>>>> account for the size of one static object.  The connection between the size
>>>> of this object and the GC performance was not easy to even find.  Further
>>>> the parameter ParGCCardsPerStrideChunk is pretty low level and not one I
>>>> think many application developers are familiar with.
>>>>
>>>> Error: VM option 'ParGCCardsPerStrideChunk' is diagnostic and must be
>>>> enabled via -XX:+UnlockDiagnosticVMOptions.
>>>> Error: Could not create the Java Virtual Machine.
>>>> Error: A fatal exception has occurred. Program will exit.
>>>>
>>>>  4) I can not reproduce this issue in Oracle 6.  This appears to be a
>>>> regression.  How can fixing a regression be called an enhancement request?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------
>>>>
>>>> This email and any attachments may contain confidential and privileged
>>>> material for the sole use of the intended recipient. Any review, copying,
>>>> or distribution of this email (or any attachments) by others is prohibited.
>>>> If you are not the intended recipient, please contact the sender
>>>> immediately and permanently delete this email and any attachments. No
>>>> employee or agent of TiVo Inc. is authorized to conclude any binding
>>>> agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo
>>>> Inc. may only be made by a signed written agreement.
>>>>
>>>
>>>
>>>
>>>
>>
>>
>

-- 
Jungwoo Ha | Java Platform Team | jwha at google.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20150514/723e61b7/attachment.htm>