RFR (trivial): 8214217: [TESTBUG] runtime/appcds/LotsOfClasses.java failed on solaris sparcv9

Tue Nov 27 21:48:21 UTC 2018

Thanks, Calvin. I'll push after Ioi also confirms, so we can clear it in 
tier3 testing.

Thanks,

Jiangli

On 11/27/18 1:30 PM, Calvin Cheung wrote:
> Looks okay to me.
>
> thanks,
> Calvin
>
> On 11/27/18, 11:34 AM, Jiangli Zhou wrote:
>> Ioi and I had further discussions. Here is the updated webrev with 
>> the error message also including the current InitialHeapSize setting:
>>
>>   http://cr.openjdk.java.net/~jiangli/8214217/webrev.02/
>>
>> I filed a RFE, https://bugs.openjdk.java.net/browse/JDK-8214388 for 
>> improving the fragmentation handing.
>>
>> Thanks,
>> Jiangli
>>
>> On 11/26/18 6:58 PM, Jiangli Zhou wrote:
>>> Hi Ioi,
>>>
>>> Here is the updated webrev with improved object archiving error 
>>> message and modified test fix. Please let me know if you have other 
>>> suggestions.
>>>
>>> http://cr.openjdk.java.net/~jiangli/8214217/webrev.01/
>>>
>>>
>>> On 11/26/18 5:49 PM, Ioi Lam wrote:
>>>> I still don’t understand why it’s necessary to have a 3gb heap to 
>>>> archive 32mb of objects. I don’t even know if this is guaranteed to 
>>>> work.
>>>
>>> I probably was not clear in my earlier reply. I think using a 
>>> lowered heap size instead of 3G setting in the test is okay. 256M 
>>> probably is not large enough (in case allocation size changes in the 
>>> future). I changed to use 500M. Please let me know if you also think 
>>> that's reasonable.
>>>
>>>>
>>>> You said “having a large enough” heap will guarantee free space. 
>>>> How large is large enough?
>>>
>>> Please see above.
>>>>
>>>> We are dumping the default archive with 128mb heap. Is that large 
>>>> enough? What’s the criteria to decide that it’s large enough?
>>>
>>> The default archive is created using the default class list, which 
>>> loads about 1000 classes. When generating the default archive, we 
>>> explicitly set the java heap size to 128M instead of relying on 
>>> ergonomics. With the 128M java heap for generating the default 
>>> archive, we have never run into the fragmentation issue. Different 
>>> java heap size should be used to meet different usage requirement.
>>>
>>>>
>>>> How should users set their heap size to guarantee success in 
>>>> dumping their own archives? This test case shows that you can get 
>>>> random failures when dumping large number of classes, so we need to 
>>>> prevent that from happening for our users.
>>>
>>> The behavior is not random. If user run into the fragmentation 
>>> error, they can try using a larger java heap.
>>>
>>>>
>>>> Printing an more elaborate error message is not enough. If the 
>>>> error is random, it may not happen during regular testing by the 
>>>> users, and only happens in deployment.
>>>
>>> Could you please explain why do you think it is random?
>>>>
>>>> Silently ignoring the error and continue dumping without archived 
>>>> heap is also suboptimal. The user may randomly lose benefit of a 
>>>> feature without even knowing it.
>>>
>>> Please let me know your suggestion.
>>>>
>>>> And you didn’t answer my question whether the problem is worse on 
>>>> Solaris than Linux.
>>>
>>> On Solaris, I can also force it to fail with the fragmentation error 
>>> with 200M java heap.
>>>
>>> Without seeing the actual gc region logging with the failed run that 
>>> didn't set java heap size explicitly, my best guess is that the work 
>>> load is different and causes Solaris to appear worse. That's why I 
>>> think it is a test bug for not setting the heap size explicitly in 
>>> this case.
>>>
>>> Thanks,
>>> Jiangli
>>>>
>>>> Thanks
>>>> Ioi
>>>>
>>>> On Nov 26, 2018, at 5:28 PM, Jiangli Zhou <jiangli.zhou at oracle.com 
>>>> <mailto:jiangli.zhou at oracle.com>> wrote:
>>>>
>>>>> Hi Ioi,
>>>>>
>>>>>
>>>>> On 11/26/18 4:42 PM, Ioi Lam wrote:
>>>>>> The purpose of the stress test is not to tweak the parameters so 
>>>>>> that the test will pass. It’s to understand the what the 
>>>>>> limitations of our system are and why they exist.
>>>>>
>>>>> Totally agree with the above.
>>>>>> As I mentioned in the bug report, why would we run into 
>>>>>> fragmentation when we have 96mb free space and we need only 32mb? 
>>>>>> That’s the answer that we need to answer, not “let’s just give a 
>>>>>> huge amount of heap”.
>>>>>
>>>>> During object archiving, we allocate from the highest free 
>>>>> regions. The allocated regions must be *consecutive* regions. 
>>>>> Those were the design decisions made in early days when I worked 
>>>>> with Thomas and others in GC team for object archiving support.
>>>>>
>>>>> The determine factor is not the total free space in the heap, it 
>>>>> is the amount of consecutive free regions available (starting from 
>>>>> the highest free one) for archiving. GC activities might cause 
>>>>> some regions at higher address being used. As we start from the 
>>>>> highest free region, if we run into an already used region during 
>>>>> allocation for archiving, we need to bail out.
>>>>>
>>>>> rn:        Free Region
>>>>> r(n-1):  Used Region
>>>>> r(n-2):  Free Region
>>>>>                 ...
>>>>>             Free Region
>>>>>             Used Region
>>>>>                ...
>>>>> r0:        Used Region
>>>>>
>>>>> For example, if we want 3 regions during archiving, we allocate 
>>>>> starting from rn. Since r(n-1) is already used, we can't use it 
>>>>> for archiving. Certainly, the design could be improved. One 
>>>>> approach that I've discussed with Thomas already is to use a 
>>>>> temporary buffer instead of allocating from the heap directly. 
>>>>> References need be adjusted during copying. With that, we can lift 
>>>>> the consecutive region requirement. Since the object archiving is 
>>>>> only supported for static archiving, and with large enough java 
>>>>> heap it is guaranteed to successfully allocate top free regions, 
>>>>> changing the current design is not a high priority task.
>>>>>> If at the end, the conclusion is that we need to have 8x the heap 
>>>>>> size of the archived object size (256mb vs 32mb), and we 
>>>>>> understand the reason why, that’s fine. But I think we should go 
>>>>>> through that analysis process first. In doing so we may be able 
>>>>>> to improve GC to make fragmentation less likely.
>>>>>
>>>>> I think the situation is well understood. Please let me know if 
>>>>> you have any additional questions, I'll try to add more information.
>>>>>
>>>>> Thanks,
>>>>> Jiangli
>>>>>
>>>>>> Also, do we know if Linux and Solaris have the exact failure 
>>>>>> mode? Or will Solaris fail more frequently than Linux with the 
>>>>>> same heap size?
>>>>>>
>>>>>> Thanks
>>>>>> Ioi
>>>>>>
>>>>>>
>>>>>>> On Nov 26, 2018, at 3:55 PM, Jiangli 
>>>>>>> Zhou<jiangli.zhou at oracle.com>  wrote:
>>>>>>>
>>>>>>> Hi Ioi,
>>>>>>>
>>>>>>>> On 11/26/18 3:35 PM, Ioi Lam wrote:
>>>>>>>>
>>>>>>>> As I commented on the bug report, we should improve the error 
>>>>>>>> message. Also, maybe we can force GC to allow the test to run 
>>>>>>>> with less heap.
>>>>>>> Updating the error message sounds good to me.
>>>>>>>> A 3GB heap seems excessive. I was able to run the test with 
>>>>>>>> -Xmx256M on Linux.
>>>>>>> Using a small heap (with only little extra space) might still 
>>>>>>> run into the issue in the future. As I pointed out, alignment 
>>>>>>> and GC activities are also factors. Allocation size might also 
>>>>>>> change in the future.
>>>>>>>
>>>>>>> An alternative approach is to fix the test to recognize the 
>>>>>>> fragmentation issue and don't report failure in that case. I'm 
>>>>>>> now in favor of that approach since it's more flexible. We can 
>>>>>>> also set a smaller heap size (such as 256M) in the test safely.
>>>>>>>> Also, I don't understand what you mean by "all observed 
>>>>>>>> allocations were done in the lower 2G range.". Why would heap 
>>>>>>>> fragmentation be related to the location of the heap?
>>>>>>> In my test run, only the heap regions in the lower 2G heap range 
>>>>>>> were used for object allocations. It's not related to the heap 
>>>>>>> location.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Jiangli
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> - Ioi
>>>>>>>>
>>>>>>>>
>>>>>>>>> On 11/26/18 3:23 PM, Jiangli Zhou wrote:
>>>>>>>>> Hi Ioi,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On 11/26/18 2:00 PM, Ioi Lam wrote:
>>>>>>>>>> Hi Jiangli,
>>>>>>>>>>
>>>>>>>>>> -Xms3G will most likely fail on 32-bit platforms.
>>>>>>>>> We can make the change for 64-bit platform only since it's a 
>>>>>>>>> 64-bit problem only. We do not archive java objects with 
>>>>>>>>> 32-bit platform.
>>>>>>>>>> BTW, why would this test fail only on Solaris and not linux? 
>>>>>>>>>> The test doesn't specify heap size, so the initial heap size 
>>>>>>>>>> setting is picked by Ergonomics. Can you reproduce the 
>>>>>>>>>> failure on Linux by using the same heap size settings used by 
>>>>>>>>>> the failed Solaris runs?
>>>>>>>>> The failed Solaris run didn't set heap size explicitly. The 
>>>>>>>>> heap size was determined by GC ergonomics, as you pointed out 
>>>>>>>>> above. I ran the test this morning on the same solaris sparc 
>>>>>>>>> machine using the same binary that was reported for the issue. 
>>>>>>>>> In my test run, a very large heap (>26G) was used according to 
>>>>>>>>> the gc region logging output. So the test didn't run into the 
>>>>>>>>> heap fragmentation issue. All observed allocations were done 
>>>>>>>>> in the lower 2G range.
>>>>>>>>>
>>>>>>>>> I don't think it is a Solaris only issue. If the heap size is 
>>>>>>>>> small enough, you could run into the issue on all supported 
>>>>>>>>> platforms. The issue could appear to be intermittent due to 
>>>>>>>>> alignment and GC activities even with the same heap size that 
>>>>>>>>> the failure was reported.
>>>>>>>>>
>>>>>>>>> On linux x64 machine, I can force the test to failure with the 
>>>>>>>>> fragmentation error with 200M java heap.
>>>>>>>>>> I think it's better to find out the root cause than just to 
>>>>>>>>>> mask it. The purpose of LotsOfClasses.java is to stress the 
>>>>>>>>>> system to find out potential bugs.
>>>>>>>>> I think this is a test issue, but not a CDS/GC issue. The test 
>>>>>>>>> loads >20000 classes, but doesn't set java heap size. Relying 
>>>>>>>>> on GC ergonomics to determine the 'right' heap size is 
>>>>>>>>> incorrect in this case since dumping objects requires 
>>>>>>>>> consecutive gc regions. Specifying the GC heap size explicitly 
>>>>>>>>> doesn't 'mask' the issue, but is the right thing to do. :)
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Jiangli
>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>> - Ioi
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On 11/26/18 1:41 PM, Jiangli Zhou wrote:
>>>>>>>>>>> Please review the following test fix, which sets the java 
>>>>>>>>>>> heap size to 3G for dumping with large number of classes.
>>>>>>>>>>>
>>>>>>>>>>> webrev:http://cr.openjdk.java.net/~jiangli/8214217/webrev.00/
>>>>>>>>>>>
>>>>>>>>>>> bug:https://bugs.openjdk.java.net/browse/JDK-8214217
>>>>>>>>>>>
>>>>>>>>>>> Tested with tier1 and tier3. Also ran the test 100 times on 
>>>>>>>>>>> solaris-sparcv9 via mach5.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Jiangli
>>>>>>>>>>>
>>>>>
>>>
>>