RFR (trivial): 8214217: [TESTBUG] runtime/appcds/LotsOfClasses.java failed on solaris sparcv9

Tue Nov 27 19:34:25 UTC 2018

Ioi and I had further discussions. Here is the updated webrev with the 
error message also including the current InitialHeapSize setting:

   http://cr.openjdk.java.net/~jiangli/8214217/webrev.02/

I filed a RFE, https://bugs.openjdk.java.net/browse/JDK-8214388 for 
improving the fragmentation handing.

Thanks,
Jiangli

On 11/26/18 6:58 PM, Jiangli Zhou wrote:
> Hi Ioi,
>
> Here is the updated webrev with improved object archiving error 
> message and modified test fix. Please let me know if you have other 
> suggestions.
>
> http://cr.openjdk.java.net/~jiangli/8214217/webrev.01/
>
>
> On 11/26/18 5:49 PM, Ioi Lam wrote:
>> I still don’t understand why it’s necessary to have a 3gb heap to 
>> archive 32mb of objects. I don’t even know if this is guaranteed to 
>> work.
>
> I probably was not clear in my earlier reply. I think using a lowered 
> heap size instead of 3G setting in the test is okay. 256M probably is 
> not large enough (in case allocation size changes in the future). I 
> changed to use 500M. Please let me know if you also think that's 
> reasonable.
>
>>
>> You said “having a large enough” heap will guarantee free space. How 
>> large is large enough?
>
> Please see above.
>>
>> We are dumping the default archive with 128mb heap. Is that large 
>> enough? What’s the criteria to decide that it’s large enough?
>
> The default archive is created using the default class list, which 
> loads about 1000 classes. When generating the default archive, we 
> explicitly set the java heap size to 128M instead of relying on 
> ergonomics. With the 128M java heap for generating the default 
> archive, we have never run into the fragmentation issue. Different 
> java heap size should be used to meet different usage requirement.
>
>>
>> How should users set their heap size to guarantee success in dumping 
>> their own archives? This test case shows that you can get random 
>> failures when dumping large number of classes, so we need to prevent 
>> that from happening for our users.
>
> The behavior is not random. If user run into the fragmentation error, 
> they can try using a larger java heap.
>
>>
>> Printing an more elaborate error message is not enough. If the error 
>> is random, it may not happen during regular testing by the users, and 
>> only happens in deployment.
>
> Could you please explain why do you think it is random?
>>
>> Silently ignoring the error and continue dumping without archived 
>> heap is also suboptimal. The user may randomly lose benefit of a 
>> feature without even knowing it.
>
> Please let me know your suggestion.
>>
>> And you didn’t answer my question whether the problem is worse on 
>> Solaris than Linux.
>
> On Solaris, I can also force it to fail with the fragmentation error 
> with 200M java heap.
>
> Without seeing the actual gc region logging with the failed run that 
> didn't set java heap size explicitly, my best guess is that the work 
> load is different and causes Solaris to appear worse. That's why I 
> think it is a test bug for not setting the heap size explicitly in 
> this case.
>
> Thanks,
> Jiangli
>>
>> Thanks
>> Ioi
>>
>> On Nov 26, 2018, at 5:28 PM, Jiangli Zhou <jiangli.zhou at oracle.com 
>> <mailto:jiangli.zhou at oracle.com>> wrote:
>>
>>> Hi Ioi,
>>>
>>>
>>> On 11/26/18 4:42 PM, Ioi Lam wrote:
>>>> The purpose of the stress test is not to tweak the parameters so 
>>>> that the test will pass. It’s to understand the what the 
>>>> limitations of our system are and why they exist.
>>>
>>> Totally agree with the above.
>>>> As I mentioned in the bug report, why would we run into 
>>>> fragmentation when we have 96mb free space and we need only 32mb? 
>>>> That’s the answer that we need to answer, not “let’s just give a 
>>>> huge amount of heap”.
>>>
>>> During object archiving, we allocate from the highest free regions. 
>>> The allocated regions must be *consecutive* regions. Those were the 
>>> design decisions made in early days when I worked with Thomas and 
>>> others in GC team for object archiving support.
>>>
>>> The determine factor is not the total free space in the heap, it is 
>>> the amount of consecutive free regions available (starting from the 
>>> highest free one) for archiving. GC activities might cause some 
>>> regions at higher address being used. As we start from the highest 
>>> free region, if we run into an already used region during allocation 
>>> for archiving, we need to bail out.
>>>
>>> rn:        Free Region
>>> r(n-1):  Used Region
>>> r(n-2):  Free Region
>>>                 ...
>>>             Free Region
>>>             Used Region
>>>                ...
>>> r0:        Used Region
>>>
>>> For example, if we want 3 regions during archiving, we allocate 
>>> starting from rn. Since r(n-1) is already used, we can't use it for 
>>> archiving. Certainly, the design could be improved. One approach 
>>> that I've discussed with Thomas already is to use a temporary buffer 
>>> instead of allocating from the heap directly. References need be 
>>> adjusted during copying. With that, we can lift the consecutive 
>>> region requirement. Since the object archiving is only supported for 
>>> static archiving, and with large enough java heap it is guaranteed 
>>> to successfully allocate top free regions, changing the current 
>>> design is not a high priority task.
>>>> If at the end, the conclusion is that we need to have 8x the heap 
>>>> size of the archived object size (256mb vs 32mb), and we understand 
>>>> the reason why, that’s fine. But I think we should go through that 
>>>> analysis process first. In doing so we may be able to improve GC to 
>>>> make fragmentation less likely.
>>>
>>> I think the situation is well understood. Please let me know if you 
>>> have any additional questions, I'll try to add more information.
>>>
>>> Thanks,
>>> Jiangli
>>>
>>>> Also, do we know if Linux and Solaris have the exact failure mode? 
>>>> Or will Solaris fail more frequently than Linux with the same heap 
>>>> size?
>>>>
>>>> Thanks
>>>> Ioi
>>>>
>>>>
>>>>> On Nov 26, 2018, at 3:55 PM, Jiangli 
>>>>> Zhou<jiangli.zhou at oracle.com>  wrote:
>>>>>
>>>>> Hi Ioi,
>>>>>
>>>>>> On 11/26/18 3:35 PM, Ioi Lam wrote:
>>>>>>
>>>>>> As I commented on the bug report, we should improve the error 
>>>>>> message. Also, maybe we can force GC to allow the test to run 
>>>>>> with less heap.
>>>>> Updating the error message sounds good to me.
>>>>>> A 3GB heap seems excessive. I was able to run the test with 
>>>>>> -Xmx256M on Linux.
>>>>> Using a small heap (with only little extra space) might still run 
>>>>> into the issue in the future. As I pointed out, alignment and GC 
>>>>> activities are also factors. Allocation size might also change in 
>>>>> the future.
>>>>>
>>>>> An alternative approach is to fix the test to recognize the 
>>>>> fragmentation issue and don't report failure in that case. I'm now 
>>>>> in favor of that approach since it's more flexible. We can also 
>>>>> set a smaller heap size (such as 256M) in the test safely.
>>>>>> Also, I don't understand what you mean by "all observed 
>>>>>> allocations were done in the lower 2G range.". Why would heap 
>>>>>> fragmentation be related to the location of the heap?
>>>>> In my test run, only the heap regions in the lower 2G heap range 
>>>>> were used for object allocations. It's not related to the heap 
>>>>> location.
>>>>>
>>>>> Thanks,
>>>>> Jiangli
>>>>>> Thanks
>>>>>>
>>>>>> - Ioi
>>>>>>
>>>>>>
>>>>>>> On 11/26/18 3:23 PM, Jiangli Zhou wrote:
>>>>>>> Hi Ioi,
>>>>>>>
>>>>>>>
>>>>>>>> On 11/26/18 2:00 PM, Ioi Lam wrote:
>>>>>>>> Hi Jiangli,
>>>>>>>>
>>>>>>>> -Xms3G will most likely fail on 32-bit platforms.
>>>>>>> We can make the change for 64-bit platform only since it's a 
>>>>>>> 64-bit problem only. We do not archive java objects with 32-bit 
>>>>>>> platform.
>>>>>>>> BTW, why would this test fail only on Solaris and not linux? 
>>>>>>>> The test doesn't specify heap size, so the initial heap size 
>>>>>>>> setting is picked by Ergonomics. Can you reproduce the failure 
>>>>>>>> on Linux by using the same heap size settings used by the 
>>>>>>>> failed Solaris runs?
>>>>>>> The failed Solaris run didn't set heap size explicitly. The heap 
>>>>>>> size was determined by GC ergonomics, as you pointed out above. 
>>>>>>> I ran the test this morning on the same solaris sparc machine 
>>>>>>> using the same binary that was reported for the issue. In my 
>>>>>>> test run, a very large heap (>26G) was used according to the gc 
>>>>>>> region logging output. So the test didn't run into the heap 
>>>>>>> fragmentation issue. All observed allocations were done in the 
>>>>>>> lower 2G range.
>>>>>>>
>>>>>>> I don't think it is a Solaris only issue. If the heap size is 
>>>>>>> small enough, you could run into the issue on all supported 
>>>>>>> platforms. The issue could appear to be intermittent due to 
>>>>>>> alignment and GC activities even with the same heap size that 
>>>>>>> the failure was reported.
>>>>>>>
>>>>>>> On linux x64 machine, I can force the test to failure with the 
>>>>>>> fragmentation error with 200M java heap.
>>>>>>>> I think it's better to find out the root cause than just to 
>>>>>>>> mask it. The purpose of LotsOfClasses.java is to stress the 
>>>>>>>> system to find out potential bugs.
>>>>>>> I think this is a test issue, but not a CDS/GC issue. The test 
>>>>>>> loads >20000 classes, but doesn't set java heap size. Relying on 
>>>>>>> GC ergonomics to determine the 'right' heap size is incorrect in 
>>>>>>> this case since dumping objects requires consecutive gc regions. 
>>>>>>> Specifying the GC heap size explicitly doesn't 'mask' the issue, 
>>>>>>> but is the right thing to do. :)
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Jiangli
>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> - Ioi
>>>>>>>>
>>>>>>>>
>>>>>>>>> On 11/26/18 1:41 PM, Jiangli Zhou wrote:
>>>>>>>>> Please review the following test fix, which sets the java heap 
>>>>>>>>> size to 3G for dumping with large number of classes.
>>>>>>>>>
>>>>>>>>> webrev:http://cr.openjdk.java.net/~jiangli/8214217/webrev.00/
>>>>>>>>>
>>>>>>>>> bug:https://bugs.openjdk.java.net/browse/JDK-8214217
>>>>>>>>>
>>>>>>>>> Tested with tier1 and tier3. Also ran the test 100 times on 
>>>>>>>>> solaris-sparcv9 via mach5.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Jiangli
>>>>>>>>>
>>>
>