RFR (trivial): 8214217: [TESTBUG] runtime/appcds/LotsOfClasses.java failed on solaris sparcv9

Tue Nov 27 22:34:04 UTC 2018

Hi Ioi,

On 11/27/18 1:57 PM, Ioi Lam wrote:
> Hi Jiangli,
>
> This patch is OK, but that's just for the purpose of removing the 
> noise in our testing. That way, we can keep running this stress test 
> for other areas (such as class loading and metadata relocation) while 
> JDK-8214388 is being fixed.
Thanks for the confirmation. Will push now.
>
> I don't think JDK-8214388 is just a minor RFE. It's a bug with 
> reliability and scalability. I like you suggested fixes. I think we 
> should try to implement it and test it as soon as possible. I 
> understand the risk of rushing the fix into JDK 12, but we can 
> implement it now and push to jdk/jdk after the JDK 12 branch has been 
> taken. If all testing goes well, I think there will be enough time to 
> backport it to JDK 12 before the JDK 12 initial release.
>
> Once JDK-8214388 is fixed, all the changes in this patch can be reverted.

No objection on looking into the enhancement/fix now. Let's investigate 
the usage requirements and understand all edge cases and limitations 
well then decide on the right solution. Since it touches the core piece 
of object archiving, caution should be taken when changing that area. I 
would also like to see how common large number of classes (>20000 or 
30000) are dumped and the typical java heap usages for those type of 
large applications. Understanding those will help us make the right 
decision to enhance the system.

As I also raised in offline discussion, for users who want to do static 
archiving using customized class list, runtime GC setting is recommended 
for dump time, since the system is most optimal in that case. When a 
different java heap size is chosen at dump time, runtime relocation 
might be done and the sharing benefit is lost for archived objects in 
the closed archive regions. With the second approach I proposed in the 
new RFE, it would still not cover all edge cases. If the dump time heap 
is not large enough and OOM occurs during classloading, we need to abort 
dumping in that case. Flexibility is disable, we should also call out 
the limitations.

Thanks,
Jiangli

>
> Thanks
>
> - Ioi
>
>
> On 11/27/18 11:34 AM, Jiangli Zhou wrote:
>> Ioi and I had further discussions. Here is the updated webrev with 
>> the error message also including the current InitialHeapSize setting:
>>
>>   http://cr.openjdk.java.net/~jiangli/8214217/webrev.02/
>>
>> I filed a RFE, https://bugs.openjdk.java.net/browse/JDK-8214388 for 
>> improving the fragmentation handing.
>>
>> Thanks,
>> Jiangli
>>
>> On 11/26/18 6:58 PM, Jiangli Zhou wrote:
>>> Hi Ioi,
>>>
>>> Here is the updated webrev with improved object archiving error 
>>> message and modified test fix. Please let me know if you have other 
>>> suggestions.
>>>
>>> http://cr.openjdk.java.net/~jiangli/8214217/webrev.01/
>>>
>>>
>>> On 11/26/18 5:49 PM, Ioi Lam wrote:
>>>> I still don’t understand why it’s necessary to have a 3gb heap to 
>>>> archive 32mb of objects. I don’t even know if this is guaranteed to 
>>>> work.
>>>
>>> I probably was not clear in my earlier reply. I think using a 
>>> lowered heap size instead of 3G setting in the test is okay. 256M 
>>> probably is not large enough (in case allocation size changes in the 
>>> future). I changed to use 500M. Please let me know if you also think 
>>> that's reasonable.
>>>
>>>>
>>>> You said “having a large enough” heap will guarantee free space. 
>>>> How large is large enough?
>>>
>>> Please see above.
>>>>
>>>> We are dumping the default archive with 128mb heap. Is that large 
>>>> enough? What’s the criteria to decide that it’s large enough?
>>>
>>> The default archive is created using the default class list, which 
>>> loads about 1000 classes. When generating the default archive, we 
>>> explicitly set the java heap size to 128M instead of relying on 
>>> ergonomics. With the 128M java heap for generating the default 
>>> archive, we have never run into the fragmentation issue. Different 
>>> java heap size should be used to meet different usage requirement.
>>>
>>>>
>>>> How should users set their heap size to guarantee success in 
>>>> dumping their own archives? This test case shows that you can get 
>>>> random failures when dumping large number of classes, so we need to 
>>>> prevent that from happening for our users.
>>>
>>> The behavior is not random. If user run into the fragmentation 
>>> error, they can try using a larger java heap.
>>>
>>>>
>>>> Printing an more elaborate error message is not enough. If the 
>>>> error is random, it may not happen during regular testing by the 
>>>> users, and only happens in deployment.
>>>
>>> Could you please explain why do you think it is random?
>>>>
>>>> Silently ignoring the error and continue dumping without archived 
>>>> heap is also suboptimal. The user may randomly lose benefit of a 
>>>> feature without even knowing it.
>>>
>>> Please let me know your suggestion.
>>>>
>>>> And you didn’t answer my question whether the problem is worse on 
>>>> Solaris than Linux.
>>>
>>> On Solaris, I can also force it to fail with the fragmentation error 
>>> with 200M java heap.
>>>
>>> Without seeing the actual gc region logging with the failed run that 
>>> didn't set java heap size explicitly, my best guess is that the work 
>>> load is different and causes Solaris to appear worse. That's why I 
>>> think it is a test bug for not setting the heap size explicitly in 
>>> this case.
>>>
>>> Thanks,
>>> Jiangli
>>>>
>>>> Thanks
>>>> Ioi
>>>>
>>>> On Nov 26, 2018, at 5:28 PM, Jiangli Zhou <jiangli.zhou at oracle.com 
>>>> <mailto:jiangli.zhou at oracle.com>> wrote:
>>>>
>>>>> Hi Ioi,
>>>>>
>>>>>
>>>>> On 11/26/18 4:42 PM, Ioi Lam wrote:
>>>>>> The purpose of the stress test is not to tweak the parameters so 
>>>>>> that the test will pass. It’s to understand the what the 
>>>>>> limitations of our system are and why they exist.
>>>>>
>>>>> Totally agree with the above.
>>>>>> As I mentioned in the bug report, why would we run into 
>>>>>> fragmentation when we have 96mb free space and we need only 32mb? 
>>>>>> That’s the answer that we need to answer, not “let’s just give a 
>>>>>> huge amount of heap”.
>>>>>
>>>>> During object archiving, we allocate from the highest free 
>>>>> regions. The allocated regions must be *consecutive* regions. 
>>>>> Those were the design decisions made in early days when I worked 
>>>>> with Thomas and others in GC team for object archiving support.
>>>>>
>>>>> The determine factor is not the total free space in the heap, it 
>>>>> is the amount of consecutive free regions available (starting from 
>>>>> the highest free one) for archiving. GC activities might cause 
>>>>> some regions at higher address being used. As we start from the 
>>>>> highest free region, if we run into an already used region during 
>>>>> allocation for archiving, we need to bail out.
>>>>>
>>>>> rn:        Free Region
>>>>> r(n-1):  Used Region
>>>>> r(n-2):  Free Region
>>>>>                 ...
>>>>>             Free Region
>>>>>             Used Region
>>>>>                ...
>>>>> r0:        Used Region
>>>>>
>>>>> For example, if we want 3 regions during archiving, we allocate 
>>>>> starting from rn. Since r(n-1) is already used, we can't use it 
>>>>> for archiving. Certainly, the design could be improved. One 
>>>>> approach that I've discussed with Thomas already is to use a 
>>>>> temporary buffer instead of allocating from the heap directly. 
>>>>> References need be adjusted during copying. With that, we can lift 
>>>>> the consecutive region requirement. Since the object archiving is 
>>>>> only supported for static archiving, and with large enough java 
>>>>> heap it is guaranteed to successfully allocate top free regions, 
>>>>> changing the current design is not a high priority task.
>>>>>> If at the end, the conclusion is that we need to have 8x the heap 
>>>>>> size of the archived object size (256mb vs 32mb), and we 
>>>>>> understand the reason why, that’s fine. But I think we should go 
>>>>>> through that analysis process first. In doing so we may be able 
>>>>>> to improve GC to make fragmentation less likely.
>>>>>
>>>>> I think the situation is well understood. Please let me know if 
>>>>> you have any additional questions, I'll try to add more information.
>>>>>
>>>>> Thanks,
>>>>> Jiangli
>>>>>
>>>>>> Also, do we know if Linux and Solaris have the exact failure 
>>>>>> mode? Or will Solaris fail more frequently than Linux with the 
>>>>>> same heap size?
>>>>>>
>>>>>> Thanks
>>>>>> Ioi
>>>>>>
>>>>>>
>>>>>>> On Nov 26, 2018, at 3:55 PM, Jiangli 
>>>>>>> Zhou<jiangli.zhou at oracle.com>  wrote:
>>>>>>>
>>>>>>> Hi Ioi,
>>>>>>>
>>>>>>>> On 11/26/18 3:35 PM, Ioi Lam wrote:
>>>>>>>>
>>>>>>>> As I commented on the bug report, we should improve the error 
>>>>>>>> message. Also, maybe we can force GC to allow the test to run 
>>>>>>>> with less heap.
>>>>>>> Updating the error message sounds good to me.
>>>>>>>> A 3GB heap seems excessive. I was able to run the test with 
>>>>>>>> -Xmx256M on Linux.
>>>>>>> Using a small heap (with only little extra space) might still 
>>>>>>> run into the issue in the future. As I pointed out, alignment 
>>>>>>> and GC activities are also factors. Allocation size might also 
>>>>>>> change in the future.
>>>>>>>
>>>>>>> An alternative approach is to fix the test to recognize the 
>>>>>>> fragmentation issue and don't report failure in that case. I'm 
>>>>>>> now in favor of that approach since it's more flexible. We can 
>>>>>>> also set a smaller heap size (such as 256M) in the test safely.
>>>>>>>> Also, I don't understand what you mean by "all observed 
>>>>>>>> allocations were done in the lower 2G range.". Why would heap 
>>>>>>>> fragmentation be related to the location of the heap?
>>>>>>> In my test run, only the heap regions in the lower 2G heap range 
>>>>>>> were used for object allocations. It's not related to the heap 
>>>>>>> location.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Jiangli
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> - Ioi
>>>>>>>>
>>>>>>>>
>>>>>>>>> On 11/26/18 3:23 PM, Jiangli Zhou wrote:
>>>>>>>>> Hi Ioi,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On 11/26/18 2:00 PM, Ioi Lam wrote:
>>>>>>>>>> Hi Jiangli,
>>>>>>>>>>
>>>>>>>>>> -Xms3G will most likely fail on 32-bit platforms.
>>>>>>>>> We can make the change for 64-bit platform only since it's a 
>>>>>>>>> 64-bit problem only. We do not archive java objects with 
>>>>>>>>> 32-bit platform.
>>>>>>>>>> BTW, why would this test fail only on Solaris and not linux? 
>>>>>>>>>> The test doesn't specify heap size, so the initial heap size 
>>>>>>>>>> setting is picked by Ergonomics. Can you reproduce the 
>>>>>>>>>> failure on Linux by using the same heap size settings used by 
>>>>>>>>>> the failed Solaris runs?
>>>>>>>>> The failed Solaris run didn't set heap size explicitly. The 
>>>>>>>>> heap size was determined by GC ergonomics, as you pointed out 
>>>>>>>>> above. I ran the test this morning on the same solaris sparc 
>>>>>>>>> machine using the same binary that was reported for the issue. 
>>>>>>>>> In my test run, a very large heap (>26G) was used according to 
>>>>>>>>> the gc region logging output. So the test didn't run into the 
>>>>>>>>> heap fragmentation issue. All observed allocations were done 
>>>>>>>>> in the lower 2G range.
>>>>>>>>>
>>>>>>>>> I don't think it is a Solaris only issue. If the heap size is 
>>>>>>>>> small enough, you could run into the issue on all supported 
>>>>>>>>> platforms. The issue could appear to be intermittent due to 
>>>>>>>>> alignment and GC activities even with the same heap size that 
>>>>>>>>> the failure was reported.
>>>>>>>>>
>>>>>>>>> On linux x64 machine, I can force the test to failure with the 
>>>>>>>>> fragmentation error with 200M java heap.
>>>>>>>>>> I think it's better to find out the root cause than just to 
>>>>>>>>>> mask it. The purpose of LotsOfClasses.java is to stress the 
>>>>>>>>>> system to find out potential bugs.
>>>>>>>>> I think this is a test issue, but not a CDS/GC issue. The test 
>>>>>>>>> loads >20000 classes, but doesn't set java heap size. Relying 
>>>>>>>>> on GC ergonomics to determine the 'right' heap size is 
>>>>>>>>> incorrect in this case since dumping objects requires 
>>>>>>>>> consecutive gc regions. Specifying the GC heap size explicitly 
>>>>>>>>> doesn't 'mask' the issue, but is the right thing to do. :)
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Jiangli
>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>> - Ioi
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On 11/26/18 1:41 PM, Jiangli Zhou wrote:
>>>>>>>>>>> Please review the following test fix, which sets the java 
>>>>>>>>>>> heap size to 3G for dumping with large number of classes.
>>>>>>>>>>>
>>>>>>>>>>> webrev:http://cr.openjdk.java.net/~jiangli/8214217/webrev.00/
>>>>>>>>>>>
>>>>>>>>>>> bug:https://bugs.openjdk.java.net/browse/JDK-8214217
>>>>>>>>>>>
>>>>>>>>>>> Tested with tier1 and tier3. Also ran the test 100 times on 
>>>>>>>>>>> solaris-sparcv9 via mach5.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Jiangli
>>>>>>>>>>>
>>>>>
>>>
>>