RFR (trivial): 8214217: [TESTBUG] runtime/appcds/LotsOfClasses.java failed on solaris sparcv9
Jiangli Zhou
jiangli.zhou at oracle.com
Tue Nov 27 19:34:25 UTC 2018
Ioi and I had further discussions. Here is the updated webrev with the
error message also including the current InitialHeapSize setting:
http://cr.openjdk.java.net/~jiangli/8214217/webrev.02/
I filed a RFE, https://bugs.openjdk.java.net/browse/JDK-8214388 for
improving the fragmentation handing.
Thanks,
Jiangli
On 11/26/18 6:58 PM, Jiangli Zhou wrote:
> Hi Ioi,
>
> Here is the updated webrev with improved object archiving error
> message and modified test fix. Please let me know if you have other
> suggestions.
>
> http://cr.openjdk.java.net/~jiangli/8214217/webrev.01/
>
>
> On 11/26/18 5:49 PM, Ioi Lam wrote:
>> I still don’t understand why it’s necessary to have a 3gb heap to
>> archive 32mb of objects. I don’t even know if this is guaranteed to
>> work.
>
> I probably was not clear in my earlier reply. I think using a lowered
> heap size instead of 3G setting in the test is okay. 256M probably is
> not large enough (in case allocation size changes in the future). I
> changed to use 500M. Please let me know if you also think that's
> reasonable.
>
>>
>> You said “having a large enough” heap will guarantee free space. How
>> large is large enough?
>
> Please see above.
>>
>> We are dumping the default archive with 128mb heap. Is that large
>> enough? What’s the criteria to decide that it’s large enough?
>
> The default archive is created using the default class list, which
> loads about 1000 classes. When generating the default archive, we
> explicitly set the java heap size to 128M instead of relying on
> ergonomics. With the 128M java heap for generating the default
> archive, we have never run into the fragmentation issue. Different
> java heap size should be used to meet different usage requirement.
>
>>
>> How should users set their heap size to guarantee success in dumping
>> their own archives? This test case shows that you can get random
>> failures when dumping large number of classes, so we need to prevent
>> that from happening for our users.
>
> The behavior is not random. If user run into the fragmentation error,
> they can try using a larger java heap.
>
>>
>> Printing an more elaborate error message is not enough. If the error
>> is random, it may not happen during regular testing by the users, and
>> only happens in deployment.
>
> Could you please explain why do you think it is random?
>>
>> Silently ignoring the error and continue dumping without archived
>> heap is also suboptimal. The user may randomly lose benefit of a
>> feature without even knowing it.
>
> Please let me know your suggestion.
>>
>> And you didn’t answer my question whether the problem is worse on
>> Solaris than Linux.
>
> On Solaris, I can also force it to fail with the fragmentation error
> with 200M java heap.
>
> Without seeing the actual gc region logging with the failed run that
> didn't set java heap size explicitly, my best guess is that the work
> load is different and causes Solaris to appear worse. That's why I
> think it is a test bug for not setting the heap size explicitly in
> this case.
>
> Thanks,
> Jiangli
>>
>> Thanks
>> Ioi
>>
>> On Nov 26, 2018, at 5:28 PM, Jiangli Zhou <jiangli.zhou at oracle.com
>> <mailto:jiangli.zhou at oracle.com>> wrote:
>>
>>> Hi Ioi,
>>>
>>>
>>> On 11/26/18 4:42 PM, Ioi Lam wrote:
>>>> The purpose of the stress test is not to tweak the parameters so
>>>> that the test will pass. It’s to understand the what the
>>>> limitations of our system are and why they exist.
>>>
>>> Totally agree with the above.
>>>> As I mentioned in the bug report, why would we run into
>>>> fragmentation when we have 96mb free space and we need only 32mb?
>>>> That’s the answer that we need to answer, not “let’s just give a
>>>> huge amount of heap”.
>>>
>>> During object archiving, we allocate from the highest free regions.
>>> The allocated regions must be *consecutive* regions. Those were the
>>> design decisions made in early days when I worked with Thomas and
>>> others in GC team for object archiving support.
>>>
>>> The determine factor is not the total free space in the heap, it is
>>> the amount of consecutive free regions available (starting from the
>>> highest free one) for archiving. GC activities might cause some
>>> regions at higher address being used. As we start from the highest
>>> free region, if we run into an already used region during allocation
>>> for archiving, we need to bail out.
>>>
>>> rn: Free Region
>>> r(n-1): Used Region
>>> r(n-2): Free Region
>>> ...
>>> Free Region
>>> Used Region
>>> ...
>>> r0: Used Region
>>>
>>> For example, if we want 3 regions during archiving, we allocate
>>> starting from rn. Since r(n-1) is already used, we can't use it for
>>> archiving. Certainly, the design could be improved. One approach
>>> that I've discussed with Thomas already is to use a temporary buffer
>>> instead of allocating from the heap directly. References need be
>>> adjusted during copying. With that, we can lift the consecutive
>>> region requirement. Since the object archiving is only supported for
>>> static archiving, and with large enough java heap it is guaranteed
>>> to successfully allocate top free regions, changing the current
>>> design is not a high priority task.
>>>> If at the end, the conclusion is that we need to have 8x the heap
>>>> size of the archived object size (256mb vs 32mb), and we understand
>>>> the reason why, that’s fine. But I think we should go through that
>>>> analysis process first. In doing so we may be able to improve GC to
>>>> make fragmentation less likely.
>>>
>>> I think the situation is well understood. Please let me know if you
>>> have any additional questions, I'll try to add more information.
>>>
>>> Thanks,
>>> Jiangli
>>>
>>>> Also, do we know if Linux and Solaris have the exact failure mode?
>>>> Or will Solaris fail more frequently than Linux with the same heap
>>>> size?
>>>>
>>>> Thanks
>>>> Ioi
>>>>
>>>>
>>>>> On Nov 26, 2018, at 3:55 PM, Jiangli
>>>>> Zhou<jiangli.zhou at oracle.com> wrote:
>>>>>
>>>>> Hi Ioi,
>>>>>
>>>>>> On 11/26/18 3:35 PM, Ioi Lam wrote:
>>>>>>
>>>>>> As I commented on the bug report, we should improve the error
>>>>>> message. Also, maybe we can force GC to allow the test to run
>>>>>> with less heap.
>>>>> Updating the error message sounds good to me.
>>>>>> A 3GB heap seems excessive. I was able to run the test with
>>>>>> -Xmx256M on Linux.
>>>>> Using a small heap (with only little extra space) might still run
>>>>> into the issue in the future. As I pointed out, alignment and GC
>>>>> activities are also factors. Allocation size might also change in
>>>>> the future.
>>>>>
>>>>> An alternative approach is to fix the test to recognize the
>>>>> fragmentation issue and don't report failure in that case. I'm now
>>>>> in favor of that approach since it's more flexible. We can also
>>>>> set a smaller heap size (such as 256M) in the test safely.
>>>>>> Also, I don't understand what you mean by "all observed
>>>>>> allocations were done in the lower 2G range.". Why would heap
>>>>>> fragmentation be related to the location of the heap?
>>>>> In my test run, only the heap regions in the lower 2G heap range
>>>>> were used for object allocations. It's not related to the heap
>>>>> location.
>>>>>
>>>>> Thanks,
>>>>> Jiangli
>>>>>> Thanks
>>>>>>
>>>>>> - Ioi
>>>>>>
>>>>>>
>>>>>>> On 11/26/18 3:23 PM, Jiangli Zhou wrote:
>>>>>>> Hi Ioi,
>>>>>>>
>>>>>>>
>>>>>>>> On 11/26/18 2:00 PM, Ioi Lam wrote:
>>>>>>>> Hi Jiangli,
>>>>>>>>
>>>>>>>> -Xms3G will most likely fail on 32-bit platforms.
>>>>>>> We can make the change for 64-bit platform only since it's a
>>>>>>> 64-bit problem only. We do not archive java objects with 32-bit
>>>>>>> platform.
>>>>>>>> BTW, why would this test fail only on Solaris and not linux?
>>>>>>>> The test doesn't specify heap size, so the initial heap size
>>>>>>>> setting is picked by Ergonomics. Can you reproduce the failure
>>>>>>>> on Linux by using the same heap size settings used by the
>>>>>>>> failed Solaris runs?
>>>>>>> The failed Solaris run didn't set heap size explicitly. The heap
>>>>>>> size was determined by GC ergonomics, as you pointed out above.
>>>>>>> I ran the test this morning on the same solaris sparc machine
>>>>>>> using the same binary that was reported for the issue. In my
>>>>>>> test run, a very large heap (>26G) was used according to the gc
>>>>>>> region logging output. So the test didn't run into the heap
>>>>>>> fragmentation issue. All observed allocations were done in the
>>>>>>> lower 2G range.
>>>>>>>
>>>>>>> I don't think it is a Solaris only issue. If the heap size is
>>>>>>> small enough, you could run into the issue on all supported
>>>>>>> platforms. The issue could appear to be intermittent due to
>>>>>>> alignment and GC activities even with the same heap size that
>>>>>>> the failure was reported.
>>>>>>>
>>>>>>> On linux x64 machine, I can force the test to failure with the
>>>>>>> fragmentation error with 200M java heap.
>>>>>>>> I think it's better to find out the root cause than just to
>>>>>>>> mask it. The purpose of LotsOfClasses.java is to stress the
>>>>>>>> system to find out potential bugs.
>>>>>>> I think this is a test issue, but not a CDS/GC issue. The test
>>>>>>> loads >20000 classes, but doesn't set java heap size. Relying on
>>>>>>> GC ergonomics to determine the 'right' heap size is incorrect in
>>>>>>> this case since dumping objects requires consecutive gc regions.
>>>>>>> Specifying the GC heap size explicitly doesn't 'mask' the issue,
>>>>>>> but is the right thing to do. :)
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Jiangli
>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> - Ioi
>>>>>>>>
>>>>>>>>
>>>>>>>>> On 11/26/18 1:41 PM, Jiangli Zhou wrote:
>>>>>>>>> Please review the following test fix, which sets the java heap
>>>>>>>>> size to 3G for dumping with large number of classes.
>>>>>>>>>
>>>>>>>>> webrev:http://cr.openjdk.java.net/~jiangli/8214217/webrev.00/
>>>>>>>>>
>>>>>>>>> bug:https://bugs.openjdk.java.net/browse/JDK-8214217
>>>>>>>>>
>>>>>>>>> Tested with tier1 and tier3. Also ran the test 100 times on
>>>>>>>>> solaris-sparcv9 via mach5.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Jiangli
>>>>>>>>>
>>>
>
More information about the hotspot-runtime-dev
mailing list