RFR (trivial): 8214217: [TESTBUG] runtime/appcds/LotsOfClasses.java failed on solaris sparcv9

Tue Nov 27 21:57:19 UTC 2018

Hi Jiangli,

This patch is OK, but that's just for the purpose of removing the noise 
in our testing. That way, we can keep running this stress test for other 
areas (such as class loading and metadata relocation) while JDK-8214388 
is being fixed.

I don't think JDK-8214388 is just a minor RFE. It's a bug with 
reliability and scalability. I like you suggested fixes. I think we 
should try to implement it and test it as soon as possible. I understand 
the risk of rushing the fix into JDK 12, but we can implement it now and 
push to jdk/jdk after the JDK 12 branch has been taken. If all testing 
goes well, I think there will be enough time to backport it to JDK 12 
before the JDK 12 initial release.

Once JDK-8214388 is fixed, all the changes in this patch can be reverted.

Thanks

- Ioi

On 11/27/18 11:34 AM, Jiangli Zhou wrote:
> Ioi and I had further discussions. Here is the updated webrev with the 
> error message also including the current InitialHeapSize setting:
>
>   http://cr.openjdk.java.net/~jiangli/8214217/webrev.02/
>
> I filed a RFE, https://bugs.openjdk.java.net/browse/JDK-8214388 for 
> improving the fragmentation handing.
>
> Thanks,
> Jiangli
>
> On 11/26/18 6:58 PM, Jiangli Zhou wrote:
>> Hi Ioi,
>>
>> Here is the updated webrev with improved object archiving error 
>> message and modified test fix. Please let me know if you have other 
>> suggestions.
>>
>> http://cr.openjdk.java.net/~jiangli/8214217/webrev.01/
>>
>>
>> On 11/26/18 5:49 PM, Ioi Lam wrote:
>>> I still don’t understand why it’s necessary to have a 3gb heap to 
>>> archive 32mb of objects. I don’t even know if this is guaranteed to 
>>> work.
>>
>> I probably was not clear in my earlier reply. I think using a lowered 
>> heap size instead of 3G setting in the test is okay. 256M probably is 
>> not large enough (in case allocation size changes in the future). I 
>> changed to use 500M. Please let me know if you also think that's 
>> reasonable.
>>
>>>
>>> You said “having a large enough” heap will guarantee free space. How 
>>> large is large enough?
>>
>> Please see above.
>>>
>>> We are dumping the default archive with 128mb heap. Is that large 
>>> enough? What’s the criteria to decide that it’s large enough?
>>
>> The default archive is created using the default class list, which 
>> loads about 1000 classes. When generating the default archive, we 
>> explicitly set the java heap size to 128M instead of relying on 
>> ergonomics. With the 128M java heap for generating the default 
>> archive, we have never run into the fragmentation issue. Different 
>> java heap size should be used to meet different usage requirement.
>>
>>>
>>> How should users set their heap size to guarantee success in dumping 
>>> their own archives? This test case shows that you can get random 
>>> failures when dumping large number of classes, so we need to prevent 
>>> that from happening for our users.
>>
>> The behavior is not random. If user run into the fragmentation error, 
>> they can try using a larger java heap.
>>
>>>
>>> Printing an more elaborate error message is not enough. If the error 
>>> is random, it may not happen during regular testing by the users, 
>>> and only happens in deployment.
>>
>> Could you please explain why do you think it is random?
>>>
>>> Silently ignoring the error and continue dumping without archived 
>>> heap is also suboptimal. The user may randomly lose benefit of a 
>>> feature without even knowing it.
>>
>> Please let me know your suggestion.
>>>
>>> And you didn’t answer my question whether the problem is worse on 
>>> Solaris than Linux.
>>
>> On Solaris, I can also force it to fail with the fragmentation error 
>> with 200M java heap.
>>
>> Without seeing the actual gc region logging with the failed run that 
>> didn't set java heap size explicitly, my best guess is that the work 
>> load is different and causes Solaris to appear worse. That's why I 
>> think it is a test bug for not setting the heap size explicitly in 
>> this case.
>>
>> Thanks,
>> Jiangli
>>>
>>> Thanks
>>> Ioi
>>>
>>> On Nov 26, 2018, at 5:28 PM, Jiangli Zhou <jiangli.zhou at oracle.com 
>>> <mailto:jiangli.zhou at oracle.com>> wrote:
>>>
>>>> Hi Ioi,
>>>>
>>>>
>>>> On 11/26/18 4:42 PM, Ioi Lam wrote:
>>>>> The purpose of the stress test is not to tweak the parameters so 
>>>>> that the test will pass. It’s to understand the what the 
>>>>> limitations of our system are and why they exist.
>>>>
>>>> Totally agree with the above.
>>>>> As I mentioned in the bug report, why would we run into 
>>>>> fragmentation when we have 96mb free space and we need only 32mb? 
>>>>> That’s the answer that we need to answer, not “let’s just give a 
>>>>> huge amount of heap”.
>>>>
>>>> During object archiving, we allocate from the highest free regions. 
>>>> The allocated regions must be *consecutive* regions. Those were the 
>>>> design decisions made in early days when I worked with Thomas and 
>>>> others in GC team for object archiving support.
>>>>
>>>> The determine factor is not the total free space in the heap, it is 
>>>> the amount of consecutive free regions available (starting from the 
>>>> highest free one) for archiving. GC activities might cause some 
>>>> regions at higher address being used. As we start from the highest 
>>>> free region, if we run into an already used region during 
>>>> allocation for archiving, we need to bail out.
>>>>
>>>> rn:        Free Region
>>>> r(n-1):  Used Region
>>>> r(n-2):  Free Region
>>>>                 ...
>>>>             Free Region
>>>>             Used Region
>>>>                ...
>>>> r0:        Used Region
>>>>
>>>> For example, if we want 3 regions during archiving, we allocate 
>>>> starting from rn. Since r(n-1) is already used, we can't use it for 
>>>> archiving. Certainly, the design could be improved. One approach 
>>>> that I've discussed with Thomas already is to use a temporary 
>>>> buffer instead of allocating from the heap directly. References 
>>>> need be adjusted during copying. With that, we can lift the 
>>>> consecutive region requirement. Since the object archiving is only 
>>>> supported for static archiving, and with large enough java heap it 
>>>> is guaranteed to successfully allocate top free regions, changing 
>>>> the current design is not a high priority task.
>>>>> If at the end, the conclusion is that we need to have 8x the heap 
>>>>> size of the archived object size (256mb vs 32mb), and we 
>>>>> understand the reason why, that’s fine. But I think we should go 
>>>>> through that analysis process first. In doing so we may be able to 
>>>>> improve GC to make fragmentation less likely.
>>>>
>>>> I think the situation is well understood. Please let me know if you 
>>>> have any additional questions, I'll try to add more information.
>>>>
>>>> Thanks,
>>>> Jiangli
>>>>
>>>>> Also, do we know if Linux and Solaris have the exact failure mode? 
>>>>> Or will Solaris fail more frequently than Linux with the same heap 
>>>>> size?
>>>>>
>>>>> Thanks
>>>>> Ioi
>>>>>
>>>>>
>>>>>> On Nov 26, 2018, at 3:55 PM, Jiangli 
>>>>>> Zhou<jiangli.zhou at oracle.com>  wrote:
>>>>>>
>>>>>> Hi Ioi,
>>>>>>
>>>>>>> On 11/26/18 3:35 PM, Ioi Lam wrote:
>>>>>>>
>>>>>>> As I commented on the bug report, we should improve the error 
>>>>>>> message. Also, maybe we can force GC to allow the test to run 
>>>>>>> with less heap.
>>>>>> Updating the error message sounds good to me.
>>>>>>> A 3GB heap seems excessive. I was able to run the test with 
>>>>>>> -Xmx256M on Linux.
>>>>>> Using a small heap (with only little extra space) might still run 
>>>>>> into the issue in the future. As I pointed out, alignment and GC 
>>>>>> activities are also factors. Allocation size might also change in 
>>>>>> the future.
>>>>>>
>>>>>> An alternative approach is to fix the test to recognize the 
>>>>>> fragmentation issue and don't report failure in that case. I'm 
>>>>>> now in favor of that approach since it's more flexible. We can 
>>>>>> also set a smaller heap size (such as 256M) in the test safely.
>>>>>>> Also, I don't understand what you mean by "all observed 
>>>>>>> allocations were done in the lower 2G range.". Why would heap 
>>>>>>> fragmentation be related to the location of the heap?
>>>>>> In my test run, only the heap regions in the lower 2G heap range 
>>>>>> were used for object allocations. It's not related to the heap 
>>>>>> location.
>>>>>>
>>>>>> Thanks,
>>>>>> Jiangli
>>>>>>> Thanks
>>>>>>>
>>>>>>> - Ioi
>>>>>>>
>>>>>>>
>>>>>>>> On 11/26/18 3:23 PM, Jiangli Zhou wrote:
>>>>>>>> Hi Ioi,
>>>>>>>>
>>>>>>>>
>>>>>>>>> On 11/26/18 2:00 PM, Ioi Lam wrote:
>>>>>>>>> Hi Jiangli,
>>>>>>>>>
>>>>>>>>> -Xms3G will most likely fail on 32-bit platforms.
>>>>>>>> We can make the change for 64-bit platform only since it's a 
>>>>>>>> 64-bit problem only. We do not archive java objects with 32-bit 
>>>>>>>> platform.
>>>>>>>>> BTW, why would this test fail only on Solaris and not linux? 
>>>>>>>>> The test doesn't specify heap size, so the initial heap size 
>>>>>>>>> setting is picked by Ergonomics. Can you reproduce the failure 
>>>>>>>>> on Linux by using the same heap size settings used by the 
>>>>>>>>> failed Solaris runs?
>>>>>>>> The failed Solaris run didn't set heap size explicitly. The 
>>>>>>>> heap size was determined by GC ergonomics, as you pointed out 
>>>>>>>> above. I ran the test this morning on the same solaris sparc 
>>>>>>>> machine using the same binary that was reported for the issue. 
>>>>>>>> In my test run, a very large heap (>26G) was used according to 
>>>>>>>> the gc region logging output. So the test didn't run into the 
>>>>>>>> heap fragmentation issue. All observed allocations were done in 
>>>>>>>> the lower 2G range.
>>>>>>>>
>>>>>>>> I don't think it is a Solaris only issue. If the heap size is 
>>>>>>>> small enough, you could run into the issue on all supported 
>>>>>>>> platforms. The issue could appear to be intermittent due to 
>>>>>>>> alignment and GC activities even with the same heap size that 
>>>>>>>> the failure was reported.
>>>>>>>>
>>>>>>>> On linux x64 machine, I can force the test to failure with the 
>>>>>>>> fragmentation error with 200M java heap.
>>>>>>>>> I think it's better to find out the root cause than just to 
>>>>>>>>> mask it. The purpose of LotsOfClasses.java is to stress the 
>>>>>>>>> system to find out potential bugs.
>>>>>>>> I think this is a test issue, but not a CDS/GC issue. The test 
>>>>>>>> loads >20000 classes, but doesn't set java heap size. Relying 
>>>>>>>> on GC ergonomics to determine the 'right' heap size is 
>>>>>>>> incorrect in this case since dumping objects requires 
>>>>>>>> consecutive gc regions. Specifying the GC heap size explicitly 
>>>>>>>> doesn't 'mask' the issue, but is the right thing to do. :)
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Jiangli
>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>> - Ioi
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On 11/26/18 1:41 PM, Jiangli Zhou wrote:
>>>>>>>>>> Please review the following test fix, which sets the java 
>>>>>>>>>> heap size to 3G for dumping with large number of classes.
>>>>>>>>>>
>>>>>>>>>> webrev:http://cr.openjdk.java.net/~jiangli/8214217/webrev.00/
>>>>>>>>>>
>>>>>>>>>> bug:https://bugs.openjdk.java.net/browse/JDK-8214217
>>>>>>>>>>
>>>>>>>>>> Tested with tier1 and tier3. Also ran the test 100 times on 
>>>>>>>>>> solaris-sparcv9 via mach5.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Jiangli
>>>>>>>>>>
>>>>
>>
>