RFR (trivial): 8214217: [TESTBUG] runtime/appcds/LotsOfClasses.java failed on solaris sparcv9

Tue Nov 27 21:30:20 UTC 2018

Looks okay to me.

thanks,
Calvin

On 11/27/18, 11:34 AM, Jiangli Zhou wrote:
> Ioi and I had further discussions. Here is the updated webrev with the 
> error message also including the current InitialHeapSize setting:
>
>   http://cr.openjdk.java.net/~jiangli/8214217/webrev.02/
>
> I filed a RFE, https://bugs.openjdk.java.net/browse/JDK-8214388 for 
> improving the fragmentation handing.
>
> Thanks,
> Jiangli
>
> On 11/26/18 6:58 PM, Jiangli Zhou wrote:
>> Hi Ioi,
>>
>> Here is the updated webrev with improved object archiving error 
>> message and modified test fix. Please let me know if you have other 
>> suggestions.
>>
>> http://cr.openjdk.java.net/~jiangli/8214217/webrev.01/
>>
>>
>> On 11/26/18 5:49 PM, Ioi Lam wrote:
>>> I still don’t understand why it’s necessary to have a 3gb heap to 
>>> archive 32mb of objects. I don’t even know if this is guaranteed to 
>>> work.
>>
>> I probably was not clear in my earlier reply. I think using a lowered 
>> heap size instead of 3G setting in the test is okay. 256M probably is 
>> not large enough (in case allocation size changes in the future). I 
>> changed to use 500M. Please let me know if you also think that's 
>> reasonable.
>>
>>>
>>> You said “having a large enough” heap will guarantee free space. How 
>>> large is large enough?
>>
>> Please see above.
>>>
>>> We are dumping the default archive with 128mb heap. Is that large 
>>> enough? What’s the criteria to decide that it’s large enough?
>>
>> The default archive is created using the default class list, which 
>> loads about 1000 classes. When generating the default archive, we 
>> explicitly set the java heap size to 128M instead of relying on 
>> ergonomics. With the 128M java heap for generating the default 
>> archive, we have never run into the fragmentation issue. Different 
>> java heap size should be used to meet different usage requirement.
>>
>>>
>>> How should users set their heap size to guarantee success in dumping 
>>> their own archives? This test case shows that you can get random 
>>> failures when dumping large number of classes, so we need to prevent 
>>> that from happening for our users.
>>
>> The behavior is not random. If user run into the fragmentation error, 
>> they can try using a larger java heap.
>>
>>>
>>> Printing an more elaborate error message is not enough. If the error 
>>> is random, it may not happen during regular testing by the users, 
>>> and only happens in deployment.
>>
>> Could you please explain why do you think it is random?
>>>
>>> Silently ignoring the error and continue dumping without archived 
>>> heap is also suboptimal. The user may randomly lose benefit of a 
>>> feature without even knowing it.
>>
>> Please let me know your suggestion.
>>>
>>> And you didn’t answer my question whether the problem is worse on 
>>> Solaris than Linux.
>>
>> On Solaris, I can also force it to fail with the fragmentation error 
>> with 200M java heap.
>>
>> Without seeing the actual gc region logging with the failed run that 
>> didn't set java heap size explicitly, my best guess is that the work 
>> load is different and causes Solaris to appear worse. That's why I 
>> think it is a test bug for not setting the heap size explicitly in 
>> this case.
>>
>> Thanks,
>> Jiangli
>>>
>>> Thanks
>>> Ioi
>>>
>>> On Nov 26, 2018, at 5:28 PM, Jiangli Zhou <jiangli.zhou at oracle.com 
>>> <mailto:jiangli.zhou at oracle.com>> wrote:
>>>
>>>> Hi Ioi,
>>>>
>>>>
>>>> On 11/26/18 4:42 PM, Ioi Lam wrote:
>>>>> The purpose of the stress test is not to tweak the parameters so 
>>>>> that the test will pass. It’s to understand the what the 
>>>>> limitations of our system are and why they exist.
>>>>
>>>> Totally agree with the above.
>>>>> As I mentioned in the bug report, why would we run into 
>>>>> fragmentation when we have 96mb free space and we need only 32mb? 
>>>>> That’s the answer that we need to answer, not “let’s just give a 
>>>>> huge amount of heap”.
>>>>
>>>> During object archiving, we allocate from the highest free regions. 
>>>> The allocated regions must be *consecutive* regions. Those were the 
>>>> design decisions made in early days when I worked with Thomas and 
>>>> others in GC team for object archiving support.
>>>>
>>>> The determine factor is not the total free space in the heap, it is 
>>>> the amount of consecutive free regions available (starting from the 
>>>> highest free one) for archiving. GC activities might cause some 
>>>> regions at higher address being used. As we start from the highest 
>>>> free region, if we run into an already used region during 
>>>> allocation for archiving, we need to bail out.
>>>>
>>>> rn:        Free Region
>>>> r(n-1):  Used Region
>>>> r(n-2):  Free Region
>>>>                 ...
>>>>             Free Region
>>>>             Used Region
>>>>                ...
>>>> r0:        Used Region
>>>>
>>>> For example, if we want 3 regions during archiving, we allocate 
>>>> starting from rn. Since r(n-1) is already used, we can't use it for 
>>>> archiving. Certainly, the design could be improved. One approach 
>>>> that I've discussed with Thomas already is to use a temporary 
>>>> buffer instead of allocating from the heap directly. References 
>>>> need be adjusted during copying. With that, we can lift the 
>>>> consecutive region requirement. Since the object archiving is only 
>>>> supported for static archiving, and with large enough java heap it 
>>>> is guaranteed to successfully allocate top free regions, changing 
>>>> the current design is not a high priority task.
>>>>> If at the end, the conclusion is that we need to have 8x the heap 
>>>>> size of the archived object size (256mb vs 32mb), and we 
>>>>> understand the reason why, that’s fine. But I think we should go 
>>>>> through that analysis process first. In doing so we may be able to 
>>>>> improve GC to make fragmentation less likely.
>>>>
>>>> I think the situation is well understood. Please let me know if you 
>>>> have any additional questions, I'll try to add more information.
>>>>
>>>> Thanks,
>>>> Jiangli
>>>>
>>>>> Also, do we know if Linux and Solaris have the exact failure mode? 
>>>>> Or will Solaris fail more frequently than Linux with the same heap 
>>>>> size?
>>>>>
>>>>> Thanks
>>>>> Ioi
>>>>>
>>>>>
>>>>>> On Nov 26, 2018, at 3:55 PM, Jiangli 
>>>>>> Zhou<jiangli.zhou at oracle.com>  wrote:
>>>>>>
>>>>>> Hi Ioi,
>>>>>>
>>>>>>> On 11/26/18 3:35 PM, Ioi Lam wrote:
>>>>>>>
>>>>>>> As I commented on the bug report, we should improve the error 
>>>>>>> message. Also, maybe we can force GC to allow the test to run 
>>>>>>> with less heap.
>>>>>> Updating the error message sounds good to me.
>>>>>>> A 3GB heap seems excessive. I was able to run the test with 
>>>>>>> -Xmx256M on Linux.
>>>>>> Using a small heap (with only little extra space) might still run 
>>>>>> into the issue in the future. As I pointed out, alignment and GC 
>>>>>> activities are also factors. Allocation size might also change in 
>>>>>> the future.
>>>>>>
>>>>>> An alternative approach is to fix the test to recognize the 
>>>>>> fragmentation issue and don't report failure in that case. I'm 
>>>>>> now in favor of that approach since it's more flexible. We can 
>>>>>> also set a smaller heap size (such as 256M) in the test safely.
>>>>>>> Also, I don't understand what you mean by "all observed 
>>>>>>> allocations were done in the lower 2G range.". Why would heap 
>>>>>>> fragmentation be related to the location of the heap?
>>>>>> In my test run, only the heap regions in the lower 2G heap range 
>>>>>> were used for object allocations. It's not related to the heap 
>>>>>> location.
>>>>>>
>>>>>> Thanks,
>>>>>> Jiangli
>>>>>>> Thanks
>>>>>>>
>>>>>>> - Ioi
>>>>>>>
>>>>>>>
>>>>>>>> On 11/26/18 3:23 PM, Jiangli Zhou wrote:
>>>>>>>> Hi Ioi,
>>>>>>>>
>>>>>>>>
>>>>>>>>> On 11/26/18 2:00 PM, Ioi Lam wrote:
>>>>>>>>> Hi Jiangli,
>>>>>>>>>
>>>>>>>>> -Xms3G will most likely fail on 32-bit platforms.
>>>>>>>> We can make the change for 64-bit platform only since it's a 
>>>>>>>> 64-bit problem only. We do not archive java objects with 32-bit 
>>>>>>>> platform.
>>>>>>>>> BTW, why would this test fail only on Solaris and not linux? 
>>>>>>>>> The test doesn't specify heap size, so the initial heap size 
>>>>>>>>> setting is picked by Ergonomics. Can you reproduce the failure 
>>>>>>>>> on Linux by using the same heap size settings used by the 
>>>>>>>>> failed Solaris runs?
>>>>>>>> The failed Solaris run didn't set heap size explicitly. The 
>>>>>>>> heap size was determined by GC ergonomics, as you pointed out 
>>>>>>>> above. I ran the test this morning on the same solaris sparc 
>>>>>>>> machine using the same binary that was reported for the issue. 
>>>>>>>> In my test run, a very large heap (>26G) was used according to 
>>>>>>>> the gc region logging output. So the test didn't run into the 
>>>>>>>> heap fragmentation issue. All observed allocations were done in 
>>>>>>>> the lower 2G range.
>>>>>>>>
>>>>>>>> I don't think it is a Solaris only issue. If the heap size is 
>>>>>>>> small enough, you could run into the issue on all supported 
>>>>>>>> platforms. The issue could appear to be intermittent due to 
>>>>>>>> alignment and GC activities even with the same heap size that 
>>>>>>>> the failure was reported.
>>>>>>>>
>>>>>>>> On linux x64 machine, I can force the test to failure with the 
>>>>>>>> fragmentation error with 200M java heap.
>>>>>>>>> I think it's better to find out the root cause than just to 
>>>>>>>>> mask it. The purpose of LotsOfClasses.java is to stress the 
>>>>>>>>> system to find out potential bugs.
>>>>>>>> I think this is a test issue, but not a CDS/GC issue. The test 
>>>>>>>> loads >20000 classes, but doesn't set java heap size. Relying 
>>>>>>>> on GC ergonomics to determine the 'right' heap size is 
>>>>>>>> incorrect in this case since dumping objects requires 
>>>>>>>> consecutive gc regions. Specifying the GC heap size explicitly 
>>>>>>>> doesn't 'mask' the issue, but is the right thing to do. :)
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Jiangli
>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>> - Ioi
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On 11/26/18 1:41 PM, Jiangli Zhou wrote:
>>>>>>>>>> Please review the following test fix, which sets the java 
>>>>>>>>>> heap size to 3G for dumping with large number of classes.
>>>>>>>>>>
>>>>>>>>>> webrev:http://cr.openjdk.java.net/~jiangli/8214217/webrev.00/
>>>>>>>>>>
>>>>>>>>>> bug:https://bugs.openjdk.java.net/browse/JDK-8214217
>>>>>>>>>>
>>>>>>>>>> Tested with tier1 and tier3. Also ran the test 100 times on 
>>>>>>>>>> solaris-sparcv9 via mach5.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Jiangli
>>>>>>>>>>
>>>>
>>
>