RFR(S) 8214388 CDS dumping fails with java heap fragmentation

Mon Dec 3 18:11:45 UTC 2018

On 11/30/18 11:06 PM, Ioi Lam wrote:
>
> On 11/30/18 11:01 PM, Ioi Lam wrote:
>>
>> On 11/30/18 9:18 PM, Jiangli Zhou wrote:
>>>
>>> Hi Ioi,
>>>
>>> On 11/30/18 2:41 PM, Ioi Lam wrote:
>>>
>>>> (Resending since my last mail seems to have been mangled)
>>>>
>>>> Hi Jiangli,
>>>>
>>>> Thanks for the review!
>>>>
>>>> Updated webrev: 
>>>> http://cr.openjdk.java.net/~iklam/jdk12/8214388-dumptime-fragmentation.v03-delta/
>>>>
>>>>
>>>> On 11/30/18 1:04 PM, Jiangli Zhou wrote:
>>>>> Hi Ioi,
>>>>>
>>>>> - src/hotspot/share/gc/g1/g1Arguments.cpp
>>>>>
>>>>> You missed FLAG_SET_ERGO(bool, ExplicitGCInvokesConcurrent, false);
>>>>
>>>>
>>>> As Stefan mentioned below, doing this is not necessary because I 
>>>> have added a new GC cause.
>>>>
>>>
>>> Ok. I see G1CollectedHeap::is_user_requested_concurrent_full_gc 
>>> returns false for default.
>>>>
>>>>
>>>>>
>>>>> - src/hotspot/share/gc/g1/g1CollectedHeap.cpp
>>>>>
>>>>> - src/hotspot/share/gc/shared/adaptiveSizePolicy.cpp
>>>>>
>>>>> Above files have no change.
>>>>>
>>>>
>>>> I had multiple MQ patches that cancel each other out. Webrev still 
>>>> generates an empty diff for those files. Please ignore them.
>>>>
>>>>
>>>>> - src/hotspot/share/gc/g1/g1HeapVerifier.cpp
>>>>>
>>>>>  320     log_warning(gc, verify)("All free regions should be at 
>>>>> the top end of the heap, but "
>>>>>  321                             " we found holes. This is 
>>>>> probably caused by (unmovable) humongous"
>>>>>  322                             " allocations, and may lead to 
>>>>> fragmentation while"
>>>>>  323                             " writing archive heap memory 
>>>>> regions.");
>>>>>
>>>>> This warning is not very clear and can be false alarming. The hole 
>>>>> in the middle of the heap causes heap fragmentation, but not 
>>>>> necessarily in the archived regions. I'd leave the warning out and 
>>>>> let the existing code in FileMapInfo::write_archive_heap_regions 
>>>>> to detect archive fragmentation and report error.
>>>>>
>>>>
>>>> This warning is about a *possibility* of fragmentation. It gives 
>>>> the user a chance to check what's going on, without having to wait 
>>>> for an actual failure to happen.
>>>>
>>>> In practice, this means if the JDK start-up code is changed in the 
>>>> future to allocate a humongous array, someone will discover this 
>>>> while building the JDK and dumping the default archive. That way, 
>>>> we can fix the JDK to avoid the use of humongous arrays.
>>>>
>>>> In fact, I just modified the code to warn against humongous regions 
>>>> even if no holes are found.
>>>>
>>>
>>> A warning implies something wrong, which is not always the case here.
>>
>>
>>> That's why it can be false alarming. To avoid that, you can change 
>>> log_warning to be log_info or log_debug, also add info in 
>>> FileMapInfo::write_archive_heap_regions error output to suggest 
>>> enabling the logging.
>>
>>
>> If I change to log_info no one will see it, because no one dumps with 
>> -Xlog:cds.
>>
>> Our current implementation doesn't handle humongous allocations 
>> properly, and may lead to fragmentation. NO ONE will see this warning 
>> unless the JDK is modified to allocate and store humongous arrays 
>> during bootstraping. If that happens we want to know it immediately.
>>
>> I fail to see how this can be considered falsely alarming. I'll 
>> gladly provide personal 24-hour customer support if someone comes and 
>> claim that they have been falsely alarmed :-)

I was suggesting to change the FileMapInfo::write_archive_heap_regions 
error to suggest user to enable -Xlog if they want to find out what 
causes the error. As this is a temporary workaround and  humongous 
allocations during dump time are rare (probably would not happen at all 
for most usages), I'm okay with what you have. Let's get this in so we 
can move on to other tasks.

>>
>>
>>
>>>
>>>>
>>>>
>>>>> - src/hotspot/share/gc/shared/gcCause.cpp
>>>>>
>>>>> I'd suggest renaming _cds_dump to _archive_time_gc. The word 
>>>>> 'archive' has been adapted to be used in GC code.
>>>>>
>>>>> - src/hotspot/share/memory/filemap.cpp
>>>>>
>>>>
>>>> Done
>>>>
>>>>
>>>>> I see no change for line 686?
>>>>>
>>>>> - src/hotspot/share/memory/heapShared.cpp
>>>>>
>>>>
>>>> A space character is missing after the word "fragmentation."
>>>>
>>>>
>>>>> Please revert the renaming for 
>>>>> HeapShared::check_closed_archive_heap_region_object. 
>>>>> HeapShared::check_closed_archive_heap_region_object_class is not 
>>>>> correct. We check the objects in the graph, not the classes. It's 
>>>>> better to use a separate bug and changeset for the bug fix in this 
>>>>> function. So it's not mixed with the change for 8214388 for better 
>>>>> back-tracking.
>>>>>
>>>>
>>>> I'll revert the whole change and create a new bug.
>>>>
>>>> I found the name check_closed_archive_heap_region_object confusing 
>>>> because it says "object" but passes a class. I think it's better to 
>>>> pass the object, and then have this function read the class from 
>>>> the object. I'll send a separate RFR.
>>>>
>>>>
>>>>> - src/hotspot/share/memory/metaspaceShared.cpp
>>>>>
>>>>> The G1 specific code needs to be wrapped with #if INCLUDE_G1GC.
>>>>>
>>>>> 391           log_warning(cds)("[line %d] extra interned string 
>>>>> ignored; size too large: %d",
>>>>> 392                            reader.last_line_no(), utf8_length);
>>>>>
>>>>> You can use cds+heap for object archiving related logging.
>>>>>
>>>>
>>>> Done
>>>>
>>>>
>>>>> - 1718     if (UseG1GC && 
>>>>> HeapShared::is_heap_object_archiving_allowed()) {
>>>>>
>>>>> As Stefan also pointed out, you don't need UseG1GC check here. 
>>>>> HeapShared::is_heap_object_archiving_allowed already includes it.
>>>>>
>>>>
>>>> Fixed
>>>>
>>>>
>>>>> G1HeapVerifier::verify_ready_for_archiving() is called into two 
>>>>> places. One is in HeapShared::archive_java_heap_objects and the 
>>>>> other one is in HeapShared::archive_java_heap_objects. The 
>>>>> duplication should be eliminated.
>>>>
>>>>
>>>> I removed the one inside metaspaceShared.cpp
>>>>
>>>>
>>>>> I also have the question for the need for 
>>>>> G1HeapVerifier::verify_ready_for_archiving(). As there is already 
>>>>> more efficient check in FileMapInfo::write_archive_heap_regions to 
>>>>> detect fragmented archive regions, verify_ready_for_archiving does 
>>>>> not add additional values. It's also a more expensive check.
>>>>>
>>>>
>>>> It's suppose to provide better logging for diagnosing future 
>>>> fragmentations (much easier to read than tracing all the gc+region 
>>>> logs.). I don't know about you, but for me trying to find out why 
>>>> fragmentation happened in JDK-8214217 was pretty damn tedious!
>>>>
>>>> It's not expensive at all since it just scans the list of regions 
>>>> once.
>>>>
>>>
>>> Please see above suggestion if more info for fragmentation is needed.
>>>
>>> After JDK 12, the approach #2 described in the RFE can be 
>>> implemented. It has no issue if there is humongous region.
>>>
>>
>> We can remove these logging AFTER fixing the problems with humongous 
>> regions.
>>
>>
>>> I'm still pounding on the question that Stefan raised in Slack 
>>> channel about how much GC work is going on during dump time. 
>>> Although only Strings, mirrors and other few types of objects being 
>>> allocated currently, if the number of allocated objects is large and 
>>> the java heap is inadequately small, what are the side-effects of 
>>> this change (how much slow down would be introduced)?
>>>
>>
>> I ran LotsOfClasses test with and without the change. over 30000 
>> classes are loaded. There are over 32MB of archived objects. The 
>> whole execution time slowed from 25 seconds to 26 seconds. It's 
>> barely noticeable.
>>
>
> Just to add one thing:
>
> What IS noticeable is -- before the fix, LotsOfClasses would fail with 
> fragmentation with 1308MB of heap when 24 GC threads are used. Now 
> LotsOfClasses can run successfully with only about 70MB of heap.

Ok, thanks for confirming.

Thanks,
Jiangli
>
>
>
>> Thanks
>>
>> - Ioi
>>
>>
>>
>>>
>>> Thanks,
>>> Jiangli
>>>>
>>>>
>>>> Thanks
>>>>
>>>> - Ioi
>>>>
>>>>
>>>>> Thanks,
>>>>> Jiangli
>>>>>
>>>>> On 11/30/18 11:02 AM, Ioi Lam wrote:
>>>>>> Hi Stefan,
>>>>>>
>>>>>> Thanks for the review. Here's an updated patch:
>>>>>>
>>>>>> http://cr.openjdk.java.net/~iklam/jdk12/8214388-dumptime-fragmentation.v02/ 
>>>>>>
>>>>>> http://cr.openjdk.java.net/~iklam/jdk12/8214388-dumptime-fragmentation.v02-delta/ 
>>>>>>
>>>>>>
>>>>>> Please see responses to your comments below.
>>>>>>
>>>>>> I also added code to handle humongous strings that can be
>>>>>> allocated during dump time, and make sure they are GC'ed (and thus
>>>>>> no humongous regions) before we starting dumping the heap.
>>>>>>
>>>>>> I'll add one more test to test the handling of humongous regions
>>>>>> that exist while the heap is being dumped, but that will require
>>>>>> a little hacking with JVMTI. I'll post that later :-)
>>>>>>
>>>>>>
>>>>>> On 11/30/18 3:17 AM, Stefan Johansson wrote:
>>>>>>> Hi Ioi,
>>>>>>>
>>>>>>> I mostly looked at the GC interaction, some comments below.
>>>>>>>
>>>>>>> On 2018-11-30 02:07, Ioi Lam wrote:
>>>>>>>> http://cr.openjdk.java.net/~iklam/jdk12/8214388-dumptime-fragmentation.v01/ 
>>>>>>>>
>>>>>>> src/hotspot/share/gc/g1/g1CollectedHeap.cpp
>>>>>>> -------------------------------------------
>>>>>>>  325 #if INCLUDE_CDS
>>>>>>>  326   if (DumpSharedSpaces && 
>>>>>>> HeapShared::is_heap_object_archiving_allowed()) {
>>>>>>>  327     // See G1HeapVerifier::verify_ready_for_archiving()
>>>>>>>  328     // This should not happen during normal operation of 
>>>>>>> -Xshare:dump, so let's give a warning.
>>>>>>>  329     log_warning(gc)("Avoid allocating humongous objects 
>>>>>>> during -Xshare:dump (" SIZE_FORMAT
>>>>>>>  330                     " bytes) - may cause fragmentation",
>>>>>>>  331                     word_size * HeapWordSize);
>>>>>>>  332   }
>>>>>>>  333 #endif
>>>>>>>
>>>>>>> I think this warning should be left out, you still have output 
>>>>>>> in the verification that tells the user what is problematic.
>>>>>>>
>>>>>>
>>>>>> Removed.
>>>>>>
>>>>>>> /src/hotspot/share/gc/g1/g1HeapVerifier.cpp
>>>>>>> -------------------------------------------
>>>>>>> What do you think about extending the verification with one more 
>>>>>>> state, maybe _fatal_hole. When iterating the regions you can 
>>>>>>> then look for holes not caused by humongous. If such holes are 
>>>>>>> found log an error and fail.
>>>>>>>
>>>>>>> You can then turn the current log_error into a log_warning 
>>>>>>> letting the user know there has been humongous allocations which 
>>>>>>> can cause problems during dump time.
>>>>>>>
>>>>>>>  314   // turn on the following assert only if we disallow 
>>>>>>> humongous allocations during
>>>>>>>  315   // dump time.
>>>>>>>  316   //assert(!cl.has_holes(), "must not have holes");
>>>>>>>
>>>>>>> I think you should remove these lines.
>>>>>>>
>>>>>>
>>>>>> I modified the verification loop as you suggested, and changed 
>>>>>> the assert to check that all holes must be caused by humongous 
>>>>>> regions.
>>>>>>
>>>>>>> src/hotspot/share/gc/shared/adaptiveSizePolicy.cpp
>>>>>>> --------------------------------------------------
>>>>>>>  184 #if INCLUDE_CDS
>>>>>>>  185   if (DumpSharedSpaces && 
>>>>>>> HeapShared::is_heap_object_archiving_allowed()) {
>>>>>>>  186     // See G1HeapVerifier::verify_ready_for_archiving()
>>>>>>>  187     return 1;
>>>>>>>  188   }
>>>>>>>  189 #endif
>>>>>>>  190
>>>>>>>
>>>>>>> I think this decision should be moved to the g1 argument 
>>>>>>> parsing. In G1Arguments::initialize() you can add this after the 
>>>>>>> initial setting of ParallelGCThreads.
>>>>>>>
>>>>>>>   // When dumping the CDS archive we want to reduce 
>>>>>>> fragmentation by
>>>>>>>   // triggering a full collection. To get as low fragmentation as
>>>>>>>   // possible we only use one worker thread.
>>>>>>>   if (DumpSharedSpaces) {
>>>>>>>     FLAG_SET_ERGO(uint, ParallelGCThreads, 1);
>>>>>>>     FLAG_SET_ERGO(bool, ExplicitGCInvokesConcurrent, false);
>>>>>>>   }
>>>>>>>
>>>>>>
>>>>>> Done.
>>>>>>
>>>>>>> As you see I also include setting ExplicitGCInvokesConcurrent to 
>>>>>>> false, because if that flag is set a full gc will be turned into 
>>>>>>> a concurrent one.
>>>>>>>
>>>>>>> src/hotspot/share/memory/metaspaceShared.cpp
>>>>>>> --------------------------------------------
>>>>>>> 1706 #if INCLUDE_CDS_JAVA_HEAP
>>>>>>> 1707     if (UseG1GC && 
>>>>>>> HeapShared::is_heap_object_archiving_allowed()) {
>>>>>>> 1708       // Avoid fragmentation while archiving heap objects.
>>>>>>> 1709 
>>>>>>> Universe::heap()->soft_ref_policy()->set_should_clear_all_soft_refs(true);
>>>>>>> 1710 Universe::heap()->collect(GCCause::_java_lang_system_gc);
>>>>>>> 1711 
>>>>>>> Universe::heap()->soft_ref_policy()->set_should_clear_all_soft_refs(false);
>>>>>>> 1712 G1HeapVerifier::verify_ready_for_archiving();
>>>>>>> 1713     }
>>>>>>> 1714 #endif
>>>>>>>
>>>>>>> Do you need the #if here, is_heap_object_archiving_allowed() 
>>>>>>> only returns true when INCLUDE_CDS_JAVA_HEAP is true. Also the 
>>>>>>> check for UseG1GC is also done within 
>>>>>>> is_heap_object_archiving_allowed().
>>>>>>>
>>>>>>> Regarding triggering the GC, the calls to the soft_ref_policy is 
>>>>>>> not needed unless you care about clearing soft references.
>>>>>>
>>>>>> I removed the #if. I kept the clearing of soft refs -- soft refs 
>>>>>> are not archived, so clearing them would give more free space for 
>>>>>> archiving.
>>>>>>
>>>>>>> An even nicer solution would be to add a new GCCause that is 
>>>>>>> only used for CDS dumping, that way you can also get a better 
>>>>>>> message in the GC log than "System.gc()" and you can also skip 
>>>>>>> the setting of ExplicitGCInvokesConcurrent that I mentioned 
>>>>>>> above. Since the new GCCause won't be affected by that flag.
>>>>>>>
>>>>>>
>>>>>> I added GCCause::_cds_dump and skipped the setting of 
>>>>>> ExplicitGCInvokesConcurrent
>>>>>>
>>>>>> Thanks
>>>>>> - Ioi
>>>>>>
>>>>>>> Thanks,
>>>>>>> Stefan
>>>>>>>
>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8214388
>>>>>>>>
>>>>>>>>
>>>>>>>> Symptom:
>>>>>>>> ========
>>>>>>>>
>>>>>>>> "java -Xshare:dump" would intermittently fail with
>>>>>>>>
>>>>>>>>     Unable to write archive heap ... due to fragmentation.
>>>>>>>>
>>>>>>>> This usually happens when you try to dump many classes (e.g. 
>>>>>>>> 10000) with
>>>>>>>> a relatively small heap (e.g., 1g) with a lot of GC threads 
>>>>>>>> (e.g., 24).
>>>>>>>>
>>>>>>>> (Example use case -- Eclipse IDE loads 15,000 classes with 
>>>>>>>> 512MB heap.)
>>>>>>>>
>>>>>>>> When GC happens during class loading, some old G1 regions  may 
>>>>>>>> be placed
>>>>>>>> at the top end of the heap (due to large number of GC threads).
>>>>>>>>
>>>>>>>> Later, when writing the archived heap, G1 tries to allocate 
>>>>>>>> contiguous
>>>>>>>> regions from the top end of the heap. This would fail due to 
>>>>>>>> the presence
>>>>>>>> of those old regions.
>>>>>>>>
>>>>>>>>
>>>>>>>> Fix:
>>>>>>>> ====
>>>>>>>>
>>>>>>>> As suggested by Stefan Johansson, we run a full GC with a 
>>>>>>>> single GC
>>>>>>>> thread. This guarantees that all old blocks will be moved to 
>>>>>>>> the bottom
>>>>>>>> end of the heap.
>>>>>>>>
>>>>>>>> Because there's no API for specifying the number of GC threads 
>>>>>>>> dynamically,
>>>>>>>> and CDS dump time doesn't allocates lots of objects, I have 
>>>>>>>> statically forced
>>>>>>>> the number of threads to 1 in 
>>>>>>>> AdaptiveSizePolicy::calc_active_workers during
>>>>>>>> CDS dump time.
>>>>>>>>
>>>>>>>> (This seems like a more direct way than assigning 
>>>>>>>> ParallelGCThreads ...)
>>>>>>>>
>>>>>>>>
>>>>>>>> Notes:
>>>>>>>> ======
>>>>>>>>
>>>>>>>> 1. Humongous regions cannot move. However, currently we don't 
>>>>>>>> do humongous
>>>>>>>>     allocations during CDS dump, so we should be fine. I have 
>>>>>>>> added diagnostics
>>>>>>>>     warnings so if fragmentation does happen in the future, the 
>>>>>>>> user can
>>>>>>>>     find out why.
>>>>>>>>
>>>>>>>> 2. Fixed a minor bug in 
>>>>>>>> HeapShared::check_closed_archive_heap_region_object_class
>>>>>>>>
>>>>>>>> 3. Fixed a bug in MetaspaceShared::read_extra_data, where the 
>>>>>>>> symbol/strings
>>>>>>>>     would be lost due to GC.
>>>>>>>>
>>>>>>>> 4. Added stress test to successfully archive about 18MB of 
>>>>>>>> objects with -Xmx64m.
>>>>>>>>     This used to fail even with -Xmx512m on a Solaris box.
>>>>>>>>
>>>>>>>> 5. With default CDS archive generation during JDK build time, 
>>>>>>>> -Xmx128m is used.
>>>>>>>>     Before this fix, the EDEN region lives at the top of the 
>>>>>>>> heap during CDS dump
>>>>>>>>     time, and we end up with a 2MB gap between the archive 
>>>>>>>> regions and the top
>>>>>>>>     of the heap. Because the archive regions cannot move, at 
>>>>>>>> run time, using CDS
>>>>>>>>     would reduce the max humongous allocation by 2MB.
>>>>>>>>
>>>>>>>>     With this fix, the archive regions are now placed at the 
>>>>>>>> very top of the heap,
>>>>>>>>     so the gap no longer exists.
>>>>>>>>
>>>>>>>>
>>>>>>>> Tests:
>>>>>>>> ======
>>>>>>>>
>>>>>>>> Running hs-tiers{1-6} for sanity.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> - Ioi
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>