RFR (trivial): 8214217: [TESTBUG] runtime/appcds/LotsOfClasses.java failed on solaris sparcv9

Tue Nov 27 23:16:23 UTC 2018

On 11/27/18 2:34 PM, Jiangli Zhou wrote:

> Hi Ioi,
>
>
> On 11/27/18 1:57 PM, Ioi Lam wrote:
>> Hi Jiangli,
>>
>> This patch is OK, but that's just for the purpose of removing the 
>> noise in our testing. That way, we can keep running this stress test 
>> for other areas (such as class loading and metadata relocation) while 
>> JDK-8214388 is being fixed.
> Thanks for the confirmation. Will push now.
>>
>> I don't think JDK-8214388 is just a minor RFE. It's a bug with 
>> reliability and scalability. I like you suggested fixes. I think we 
>> should try to implement it and test it as soon as possible. I 
>> understand the risk of rushing the fix into JDK 12, but we can 
>> implement it now and push to jdk/jdk after the JDK 12 branch has been 
>> taken. If all testing goes well, I think there will be enough time to 
>> backport it to JDK 12 before the JDK 12 initial release.
>>
>> Once JDK-8214388 is fixed, all the changes in this patch can be 
>> reverted.
>
> No objection on looking into the enhancement/fix now. Let's 
> investigate the usage requirements and understand all edge cases and 
> limitations well then decide on the right solution. Since it touches 
> the core piece of object archiving, caution should be taken when 
> changing that area. I would also like to see how common large number 
> of classes (>20000 or 30000) are dumped and the typical java heap 
> usages for those type of large applications. Understanding those will 
> help us make the right decision to enhance the system.
>
> As I also raised in offline discussion, for users who want to do 
> static archiving using customized class list, runtime GC setting is 
> recommended for dump time, since the system is most optimal in that 
> case. When a different java heap size is chosen at dump time, runtime 
> relocation might be done and the sharing benefit is lost for archived 
> objects in the closed archive regions. With the second approach I 
> proposed in the new RFE, it would still not cover all edge cases. If 
> the dump time heap is not large enough and OOM occurs during 
> classloading, we need to abort dumping in that case. Flexibility is 
> disable, we should also call out the limitations.

Another edge case that we need to bail out:
If the dump time java heap has only enough space for allocations for 
loading/linking, but not enough space left for archiving, even with 
proposed solution #2, object archiving cannot be performed. A larger 
java heap would be needed in that case.

I will recorded all above cases in the RFE and we can redirect the 
discussions to there.

Thanks,
Jiangli
>
> Thanks,
> Jiangli
>
>>
>> Thanks
>>
>> - Ioi
>>
>>
>> On 11/27/18 11:34 AM, Jiangli Zhou wrote:
>>> Ioi and I had further discussions. Here is the updated webrev with 
>>> the error message also including the current InitialHeapSize setting:
>>>
>>>   http://cr.openjdk.java.net/~jiangli/8214217/webrev.02/
>>>
>>> I filed a RFE, https://bugs.openjdk.java.net/browse/JDK-8214388 for 
>>> improving the fragmentation handing.
>>>
>>> Thanks,
>>> Jiangli
>>>
>>> On 11/26/18 6:58 PM, Jiangli Zhou wrote:
>>>> Hi Ioi,
>>>>
>>>> Here is the updated webrev with improved object archiving error 
>>>> message and modified test fix. Please let me know if you have other 
>>>> suggestions.
>>>>
>>>> http://cr.openjdk.java.net/~jiangli/8214217/webrev.01/
>>>>
>>>>
>>>> On 11/26/18 5:49 PM, Ioi Lam wrote:
>>>>> I still don’t understand why it’s necessary to have a 3gb heap to 
>>>>> archive 32mb of objects. I don’t even know if this is guaranteed 
>>>>> to work.
>>>>
>>>> I probably was not clear in my earlier reply. I think using a 
>>>> lowered heap size instead of 3G setting in the test is okay. 256M 
>>>> probably is not large enough (in case allocation size changes in 
>>>> the future). I changed to use 500M. Please let me know if you also 
>>>> think that's reasonable.
>>>>
>>>>>
>>>>> You said “having a large enough” heap will guarantee free space. 
>>>>> How large is large enough?
>>>>
>>>> Please see above.
>>>>>
>>>>> We are dumping the default archive with 128mb heap. Is that large 
>>>>> enough? What’s the criteria to decide that it’s large enough?
>>>>
>>>> The default archive is created using the default class list, which 
>>>> loads about 1000 classes. When generating the default archive, we 
>>>> explicitly set the java heap size to 128M instead of relying on 
>>>> ergonomics. With the 128M java heap for generating the default 
>>>> archive, we have never run into the fragmentation issue. Different 
>>>> java heap size should be used to meet different usage requirement.
>>>>
>>>>>
>>>>> How should users set their heap size to guarantee success in 
>>>>> dumping their own archives? This test case shows that you can get 
>>>>> random failures when dumping large number of classes, so we need 
>>>>> to prevent that from happening for our users.
>>>>
>>>> The behavior is not random. If user run into the fragmentation 
>>>> error, they can try using a larger java heap.
>>>>
>>>>>
>>>>> Printing an more elaborate error message is not enough. If the 
>>>>> error is random, it may not happen during regular testing by the 
>>>>> users, and only happens in deployment.
>>>>
>>>> Could you please explain why do you think it is random?
>>>>>
>>>>> Silently ignoring the error and continue dumping without archived 
>>>>> heap is also suboptimal. The user may randomly lose benefit of a 
>>>>> feature without even knowing it.
>>>>
>>>> Please let me know your suggestion.
>>>>>
>>>>> And you didn’t answer my question whether the problem is worse on 
>>>>> Solaris than Linux.
>>>>
>>>> On Solaris, I can also force it to fail with the fragmentation 
>>>> error with 200M java heap.
>>>>
>>>> Without seeing the actual gc region logging with the failed run 
>>>> that didn't set java heap size explicitly, my best guess is that 
>>>> the work load is different and causes Solaris to appear worse. 
>>>> That's why I think it is a test bug for not setting the heap size 
>>>> explicitly in this case.
>>>>
>>>> Thanks,
>>>> Jiangli
>>>>>
>>>>> Thanks
>>>>> Ioi
>>>>>
>>>>> On Nov 26, 2018, at 5:28 PM, Jiangli Zhou <jiangli.zhou at oracle.com 
>>>>> <mailto:jiangli.zhou at oracle.com>> wrote:
>>>>>
>>>>>> Hi Ioi,
>>>>>>
>>>>>>
>>>>>> On 11/26/18 4:42 PM, Ioi Lam wrote:
>>>>>>> The purpose of the stress test is not to tweak the parameters so 
>>>>>>> that the test will pass. It’s to understand the what the 
>>>>>>> limitations of our system are and why they exist.
>>>>>>
>>>>>> Totally agree with the above.
>>>>>>> As I mentioned in the bug report, why would we run into 
>>>>>>> fragmentation when we have 96mb free space and we need only 
>>>>>>> 32mb? That’s the answer that we need to answer, not “let’s just 
>>>>>>> give a huge amount of heap”.
>>>>>>
>>>>>> During object archiving, we allocate from the highest free 
>>>>>> regions. The allocated regions must be *consecutive* regions. 
>>>>>> Those were the design decisions made in early days when I worked 
>>>>>> with Thomas and others in GC team for object archiving support.
>>>>>>
>>>>>> The determine factor is not the total free space in the heap, it 
>>>>>> is the amount of consecutive free regions available (starting 
>>>>>> from the highest free one) for archiving. GC activities might 
>>>>>> cause some regions at higher address being used. As we start from 
>>>>>> the highest free region, if we run into an already used region 
>>>>>> during allocation for archiving, we need to bail out.
>>>>>>
>>>>>> rn:        Free Region
>>>>>> r(n-1):  Used Region
>>>>>> r(n-2):  Free Region
>>>>>>                 ...
>>>>>>             Free Region
>>>>>>             Used Region
>>>>>>                ...
>>>>>> r0:        Used Region
>>>>>>
>>>>>> For example, if we want 3 regions during archiving, we allocate 
>>>>>> starting from rn. Since r(n-1) is already used, we can't use it 
>>>>>> for archiving. Certainly, the design could be improved. One 
>>>>>> approach that I've discussed with Thomas already is to use a 
>>>>>> temporary buffer instead of allocating from the heap directly. 
>>>>>> References need be adjusted during copying. With that, we can 
>>>>>> lift the consecutive region requirement. Since the object 
>>>>>> archiving is only supported for static archiving, and with large 
>>>>>> enough java heap it is guaranteed to successfully allocate top 
>>>>>> free regions, changing the current design is not a high priority 
>>>>>> task.
>>>>>>> If at the end, the conclusion is that we need to have 8x the 
>>>>>>> heap size of the archived object size (256mb vs 32mb), and we 
>>>>>>> understand the reason why, that’s fine. But I think we should go 
>>>>>>> through that analysis process first. In doing so we may be able 
>>>>>>> to improve GC to make fragmentation less likely.
>>>>>>
>>>>>> I think the situation is well understood. Please let me know if 
>>>>>> you have any additional questions, I'll try to add more information.
>>>>>>
>>>>>> Thanks,
>>>>>> Jiangli
>>>>>>
>>>>>>> Also, do we know if Linux and Solaris have the exact failure 
>>>>>>> mode? Or will Solaris fail more frequently than Linux with the 
>>>>>>> same heap size?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Ioi
>>>>>>>
>>>>>>>
>>>>>>>> On Nov 26, 2018, at 3:55 PM, Jiangli 
>>>>>>>> Zhou<jiangli.zhou at oracle.com>  wrote:
>>>>>>>>
>>>>>>>> Hi Ioi,
>>>>>>>>
>>>>>>>>> On 11/26/18 3:35 PM, Ioi Lam wrote:
>>>>>>>>>
>>>>>>>>> As I commented on the bug report, we should improve the error 
>>>>>>>>> message. Also, maybe we can force GC to allow the test to run 
>>>>>>>>> with less heap.
>>>>>>>> Updating the error message sounds good to me.
>>>>>>>>> A 3GB heap seems excessive. I was able to run the test with 
>>>>>>>>> -Xmx256M on Linux.
>>>>>>>> Using a small heap (with only little extra space) might still 
>>>>>>>> run into the issue in the future. As I pointed out, alignment 
>>>>>>>> and GC activities are also factors. Allocation size might also 
>>>>>>>> change in the future.
>>>>>>>>
>>>>>>>> An alternative approach is to fix the test to recognize the 
>>>>>>>> fragmentation issue and don't report failure in that case. I'm 
>>>>>>>> now in favor of that approach since it's more flexible. We can 
>>>>>>>> also set a smaller heap size (such as 256M) in the test safely.
>>>>>>>>> Also, I don't understand what you mean by "all observed 
>>>>>>>>> allocations were done in the lower 2G range.". Why would heap 
>>>>>>>>> fragmentation be related to the location of the heap?
>>>>>>>> In my test run, only the heap regions in the lower 2G heap 
>>>>>>>> range were used for object allocations. It's not related to the 
>>>>>>>> heap location.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Jiangli
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>> - Ioi
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On 11/26/18 3:23 PM, Jiangli Zhou wrote:
>>>>>>>>>> Hi Ioi,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On 11/26/18 2:00 PM, Ioi Lam wrote:
>>>>>>>>>>> Hi Jiangli,
>>>>>>>>>>>
>>>>>>>>>>> -Xms3G will most likely fail on 32-bit platforms.
>>>>>>>>>> We can make the change for 64-bit platform only since it's a 
>>>>>>>>>> 64-bit problem only. We do not archive java objects with 
>>>>>>>>>> 32-bit platform.
>>>>>>>>>>> BTW, why would this test fail only on Solaris and not linux? 
>>>>>>>>>>> The test doesn't specify heap size, so the initial heap size 
>>>>>>>>>>> setting is picked by Ergonomics. Can you reproduce the 
>>>>>>>>>>> failure on Linux by using the same heap size settings used 
>>>>>>>>>>> by the failed Solaris runs?
>>>>>>>>>> The failed Solaris run didn't set heap size explicitly. The 
>>>>>>>>>> heap size was determined by GC ergonomics, as you pointed out 
>>>>>>>>>> above. I ran the test this morning on the same solaris sparc 
>>>>>>>>>> machine using the same binary that was reported for the 
>>>>>>>>>> issue. In my test run, a very large heap (>26G) was used 
>>>>>>>>>> according to the gc region logging output. So the test didn't 
>>>>>>>>>> run into the heap fragmentation issue. All observed 
>>>>>>>>>> allocations were done in the lower 2G range.
>>>>>>>>>>
>>>>>>>>>> I don't think it is a Solaris only issue. If the heap size is 
>>>>>>>>>> small enough, you could run into the issue on all supported 
>>>>>>>>>> platforms. The issue could appear to be intermittent due to 
>>>>>>>>>> alignment and GC activities even with the same heap size that 
>>>>>>>>>> the failure was reported.
>>>>>>>>>>
>>>>>>>>>> On linux x64 machine, I can force the test to failure with 
>>>>>>>>>> the fragmentation error with 200M java heap.
>>>>>>>>>>> I think it's better to find out the root cause than just to 
>>>>>>>>>>> mask it. The purpose of LotsOfClasses.java is to stress the 
>>>>>>>>>>> system to find out potential bugs.
>>>>>>>>>> I think this is a test issue, but not a CDS/GC issue. The 
>>>>>>>>>> test loads >20000 classes, but doesn't set java heap size. 
>>>>>>>>>> Relying on GC ergonomics to determine the 'right' heap size 
>>>>>>>>>> is incorrect in this case since dumping objects requires 
>>>>>>>>>> consecutive gc regions. Specifying the GC heap size 
>>>>>>>>>> explicitly doesn't 'mask' the issue, but is the right thing 
>>>>>>>>>> to do. :)
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Jiangli
>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>>
>>>>>>>>>>> - Ioi
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> On 11/26/18 1:41 PM, Jiangli Zhou wrote:
>>>>>>>>>>>> Please review the following test fix, which sets the java 
>>>>>>>>>>>> heap size to 3G for dumping with large number of classes.
>>>>>>>>>>>>
>>>>>>>>>>>> webrev:http://cr.openjdk.java.net/~jiangli/8214217/webrev.00/
>>>>>>>>>>>>
>>>>>>>>>>>> bug:https://bugs.openjdk.java.net/browse/JDK-8214217
>>>>>>>>>>>>
>>>>>>>>>>>> Tested with tier1 and tier3. Also ran the test 100 times on 
>>>>>>>>>>>> solaris-sparcv9 via mach5.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> Jiangli
>>>>>>>>>>>>
>>>>>>
>>>>
>>>
>