RFR (trivial): 8214217: [TESTBUG] runtime/appcds/LotsOfClasses.java failed on solaris sparcv9

Tue Nov 27 01:28:06 UTC 2018

Hi Ioi,

On 11/26/18 4:42 PM, Ioi Lam wrote:
> The purpose of the stress test is not to tweak the parameters so that the test will pass. It’s to understand the what the limitations of our system are and why they exist.

Totally agree with the above.
>
> As I mentioned in the bug report, why would we run into fragmentation when we have 96mb free space and we need only 32mb? That’s the answer that we need to answer, not “let’s just give a huge amount of heap”.

During object archiving, we allocate from the highest free regions. The 
allocated regions must be *consecutive* regions. Those were the design 
decisions made in early days when I worked with Thomas and others in GC 
team for object archiving support.

The determine factor is not the total free space in the heap, it is the 
amount of consecutive free regions available (starting from the highest 
free one) for archiving. GC activities might cause some regions at 
higher address being used. As we start from the highest free region, if 
we run into an already used region during allocation for archiving, we 
need to bail out.

rn:        Free Region
r(n-1):  Used Region
r(n-2):  Free Region
                 ...
             Free Region
             Used Region
                ...
r0:        Used Region

For example, if we want 3 regions during archiving, we allocate starting 
from rn. Since r(n-1) is already used, we can't use it for archiving. 
Certainly, the design could be improved. One approach that I've 
discussed with Thomas already is to use a temporary buffer instead of 
allocating from the heap directly. References need be adjusted during 
copying. With that, we can lift the consecutive region requirement. 
Since the object archiving is only supported for static archiving, and 
with large enough java heap it is guaranteed to successfully allocate 
top free regions, changing the current design is not a high priority task.
>
> If at the end, the conclusion is that we need to have 8x the heap size of the archived object size (256mb vs 32mb), and we understand the reason why, that’s fine. But I think we should go through that analysis process first. In doing so we may be able to improve GC to make fragmentation less likely.

I think the situation is well understood. Please let me know if you have 
any additional questions, I'll try to add more information.

Thanks,
Jiangli

>
> Also, do we know if Linux and Solaris have the exact failure mode? Or will Solaris fail more frequently than Linux with the same heap size?
>
> Thanks
> Ioi
>
>
>> On Nov 26, 2018, at 3:55 PM, Jiangli Zhou <jiangli.zhou at oracle.com> wrote:
>>
>> Hi Ioi,
>>
>>> On 11/26/18 3:35 PM, Ioi Lam wrote:
>>>
>>> As I commented on the bug report, we should improve the error message. Also, maybe we can force GC to allow the test to run with less heap.
>> Updating the error message sounds good to me.
>>> A 3GB heap seems excessive. I was able to run the test with -Xmx256M on Linux.
>> Using a small heap (with only little extra space) might still run into the issue in the future. As I pointed out, alignment and GC activities are also factors. Allocation size might also change in the future.
>>
>> An alternative approach is to fix the test to recognize the fragmentation issue and don't report failure in that case. I'm now in favor of that approach since it's more flexible. We can also set a smaller heap size (such as 256M) in the test safely.
>>> Also, I don't understand what you mean by "all observed allocations were done in the lower 2G range.". Why would heap fragmentation be related to the location of the heap?
>> In my test run, only the heap regions in the lower 2G heap range were used for object allocations. It's not related to the heap location.
>>
>> Thanks,
>> Jiangli
>>> Thanks
>>>
>>> - Ioi
>>>
>>>
>>>> On 11/26/18 3:23 PM, Jiangli Zhou wrote:
>>>> Hi Ioi,
>>>>
>>>>
>>>>> On 11/26/18 2:00 PM, Ioi Lam wrote:
>>>>> Hi Jiangli,
>>>>>
>>>>> -Xms3G will most likely fail on 32-bit platforms.
>>>> We can make the change for 64-bit platform only since it's a 64-bit problem only. We do not archive java objects with 32-bit platform.
>>>>> BTW, why would this test fail only on Solaris and not linux? The test doesn't specify heap size, so the initial heap size setting is picked by Ergonomics. Can you reproduce the failure on Linux by using the same heap size settings used by the failed Solaris runs?
>>>> The failed Solaris run didn't set heap size explicitly. The heap size was determined by GC ergonomics, as you pointed out above. I ran the test this morning on the same solaris sparc machine using the same binary that was reported for the issue. In my test run, a very large heap (>26G) was used according to the gc region logging output. So the test didn't run into the heap fragmentation issue. All observed allocations were done in the lower 2G range.
>>>>
>>>> I don't think it is a Solaris only issue. If the heap size is small enough, you could run into the issue on all supported platforms. The issue could appear to be intermittent due to alignment and GC activities even with the same heap size that the failure was reported.
>>>>
>>>> On linux x64 machine, I can force the test to failure with the fragmentation error with 200M java heap.
>>>>> I think it's better to find out the root cause than just to mask it. The purpose of LotsOfClasses.java is to stress the system to find out potential bugs.
>>>> I think this is a test issue, but not a CDS/GC issue. The test loads >20000 classes, but doesn't set java heap size. Relying on GC ergonomics to determine the 'right' heap size is incorrect in this case since dumping objects requires consecutive gc regions. Specifying the GC heap size explicitly doesn't 'mask' the issue, but is the right thing to do. :)
>>>>
>>>> Thanks,
>>>> Jiangli
>>>>
>>>>> Thanks
>>>>>
>>>>> - Ioi
>>>>>
>>>>>
>>>>>> On 11/26/18 1:41 PM, Jiangli Zhou wrote:
>>>>>> Please review the following test fix, which sets the java heap size to 3G for dumping with large number of classes.
>>>>>>
>>>>>>    webrev: http://cr.openjdk.java.net/~jiangli/8214217/webrev.00/
>>>>>>
>>>>>>    bug: https://bugs.openjdk.java.net/browse/JDK-8214217
>>>>>>
>>>>>> Tested with tier1 and tier3. Also ran the test 100 times on solaris-sparcv9 via mach5.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Jiangli
>>>>>>