RFR(S) 8214388 CDS dumping fails with java heap fragmentation

Sat Dec 1 07:06:48 UTC 2018

On 11/30/18 11:01 PM, Ioi Lam wrote:
>
> On 11/30/18 9:18 PM, Jiangli Zhou wrote:
>>
>> Hi Ioi,
>>
>> On 11/30/18 2:41 PM, Ioi Lam wrote:
>>
>>> (Resending since my last mail seems to have been mangled)
>>>
>>> Hi Jiangli,
>>>
>>> Thanks for the review!
>>>
>>> Updated webrev: 
>>> http://cr.openjdk.java.net/~iklam/jdk12/8214388-dumptime-fragmentation.v03-delta/
>>>
>>>
>>> On 11/30/18 1:04 PM, Jiangli Zhou wrote:
>>>> Hi Ioi,
>>>>
>>>> - src/hotspot/share/gc/g1/g1Arguments.cpp
>>>>
>>>> You missed FLAG_SET_ERGO(bool, ExplicitGCInvokesConcurrent, false);
>>>
>>>
>>> As Stefan mentioned below, doing this is not necessary because I 
>>> have added a new GC cause.
>>>
>>
>> Ok. I see G1CollectedHeap::is_user_requested_concurrent_full_gc 
>> returns false for default.
>>>
>>>
>>>>
>>>> - src/hotspot/share/gc/g1/g1CollectedHeap.cpp
>>>>
>>>> - src/hotspot/share/gc/shared/adaptiveSizePolicy.cpp
>>>>
>>>> Above files have no change.
>>>>
>>>
>>> I had multiple MQ patches that cancel each other out. Webrev still 
>>> generates an empty diff for those files. Please ignore them.
>>>
>>>
>>>> - src/hotspot/share/gc/g1/g1HeapVerifier.cpp
>>>>
>>>>  320     log_warning(gc, verify)("All free regions should be at the 
>>>> top end of the heap, but "
>>>>  321                             " we found holes. This is probably 
>>>> caused by (unmovable) humongous"
>>>>  322                             " allocations, and may lead to 
>>>> fragmentation while"
>>>>  323                             " writing archive heap memory 
>>>> regions.");
>>>>
>>>> This warning is not very clear and can be false alarming. The hole 
>>>> in the middle of the heap causes heap fragmentation, but not 
>>>> necessarily in the archived regions. I'd leave the warning out and 
>>>> let the existing code in FileMapInfo::write_archive_heap_regions to 
>>>> detect archive fragmentation and report error.
>>>>
>>>
>>> This warning is about a *possibility* of fragmentation. It gives the 
>>> user a chance to check what's going on, without having to wait for 
>>> an actual failure to happen.
>>>
>>> In practice, this means if the JDK start-up code is changed in the 
>>> future to allocate a humongous array, someone will discover this 
>>> while building the JDK and dumping the default archive. That way, we 
>>> can fix the JDK to avoid the use of humongous arrays.
>>>
>>> In fact, I just modified the code to warn against humongous regions 
>>> even if no holes are found.
>>>
>>
>> A warning implies something wrong, which is not always the case here.
>
>
>> That's why it can be false alarming. To avoid that, you can change 
>> log_warning to be log_info or log_debug, also add info in 
>> FileMapInfo::write_archive_heap_regions error output to suggest 
>> enabling the logging.
>
>
> If I change to log_info no one will see it, because no one dumps with 
> -Xlog:cds.
>
> Our current implementation doesn't handle humongous allocations 
> properly, and may lead to fragmentation. NO ONE will see this warning 
> unless the JDK is modified to allocate and store humongous arrays 
> during bootstraping. If that happens we want to know it immediately.
>
> I fail to see how this can be considered falsely alarming. I'll gladly 
> provide personal 24-hour customer support if someone comes and claim 
> that they have been falsely alarmed :-)
>
>
>
>>
>>>
>>>
>>>> - src/hotspot/share/gc/shared/gcCause.cpp
>>>>
>>>> I'd suggest renaming _cds_dump to _archive_time_gc. The word 
>>>> 'archive' has been adapted to be used in GC code.
>>>>
>>>> - src/hotspot/share/memory/filemap.cpp
>>>>
>>>
>>> Done
>>>
>>>
>>>> I see no change for line 686?
>>>>
>>>> - src/hotspot/share/memory/heapShared.cpp
>>>>
>>>
>>> A space character is missing after the word "fragmentation."
>>>
>>>
>>>> Please revert the renaming for 
>>>> HeapShared::check_closed_archive_heap_region_object. 
>>>> HeapShared::check_closed_archive_heap_region_object_class is not 
>>>> correct. We check the objects in the graph, not the classes. It's 
>>>> better to use a separate bug and changeset for the bug fix in this 
>>>> function. So it's not mixed with the change for 8214388 for better 
>>>> back-tracking.
>>>>
>>>
>>> I'll revert the whole change and create a new bug.
>>>
>>> I found the name check_closed_archive_heap_region_object confusing 
>>> because it says "object" but passes a class. I think it's better to 
>>> pass the object, and then have this function read the class from the 
>>> object. I'll send a separate RFR.
>>>
>>>
>>>> - src/hotspot/share/memory/metaspaceShared.cpp
>>>>
>>>> The G1 specific code needs to be wrapped with #if INCLUDE_G1GC.
>>>>
>>>> 391           log_warning(cds)("[line %d] extra interned string 
>>>> ignored; size too large: %d",
>>>> 392                            reader.last_line_no(), utf8_length);
>>>>
>>>> You can use cds+heap for object archiving related logging.
>>>>
>>>
>>> Done
>>>
>>>
>>>> - 1718     if (UseG1GC && 
>>>> HeapShared::is_heap_object_archiving_allowed()) {
>>>>
>>>> As Stefan also pointed out, you don't need UseG1GC check here. 
>>>> HeapShared::is_heap_object_archiving_allowed already includes it.
>>>>
>>>
>>> Fixed
>>>
>>>
>>>> G1HeapVerifier::verify_ready_for_archiving() is called into two 
>>>> places. One is in HeapShared::archive_java_heap_objects and the 
>>>> other one is in HeapShared::archive_java_heap_objects. The 
>>>> duplication should be eliminated.
>>>
>>>
>>> I removed the one inside metaspaceShared.cpp
>>>
>>>
>>>> I also have the question for the need for 
>>>> G1HeapVerifier::verify_ready_for_archiving(). As there is already 
>>>> more efficient check in FileMapInfo::write_archive_heap_regions to 
>>>> detect fragmented archive regions, verify_ready_for_archiving does 
>>>> not add additional values. It's also a more expensive check.
>>>>
>>>
>>> It's suppose to provide better logging for diagnosing future 
>>> fragmentations (much easier to read than tracing all the gc+region 
>>> logs.). I don't know about you, but for me trying to find out why 
>>> fragmentation happened in JDK-8214217 was pretty damn tedious!
>>>
>>> It's not expensive at all since it just scans the list of regions once.
>>>
>>
>> Please see above suggestion if more info for fragmentation is needed.
>>
>> After JDK 12, the approach #2 described in the RFE can be 
>> implemented. It has no issue if there is humongous region.
>>
>
> We can remove these logging AFTER fixing the problems with humongous 
> regions.
>
>
>> I'm still pounding on the question that Stefan raised in Slack 
>> channel about how much GC work is going on during dump time. Although 
>> only Strings, mirrors and other few types of objects being allocated 
>> currently, if the number of allocated objects is large and the java 
>> heap is inadequately small, what are the side-effects of this change 
>> (how much slow down would be introduced)?
>>
>
> I ran LotsOfClasses test with and without the change. over 30000 
> classes are loaded. There are over 32MB of archived objects. The whole 
> execution time slowed from 25 seconds to 26 seconds. It's barely 
> noticeable.
>

Just to add one thing:

What IS noticeable is -- before the fix, LotsOfClasses would fail with 
fragmentation with 1308MB of heap when 24 GC threads are used. Now 
LotsOfClasses can run successfully with only about 70MB of heap.

> Thanks
>
> - Ioi
>
>
>
>>
>> Thanks,
>> Jiangli
>>>
>>>
>>> Thanks
>>>
>>> - Ioi
>>>
>>>
>>>> Thanks,
>>>> Jiangli
>>>>
>>>> On 11/30/18 11:02 AM, Ioi Lam wrote:
>>>>> Hi Stefan,
>>>>>
>>>>> Thanks for the review. Here's an updated patch:
>>>>>
>>>>> http://cr.openjdk.java.net/~iklam/jdk12/8214388-dumptime-fragmentation.v02/ 
>>>>>
>>>>> http://cr.openjdk.java.net/~iklam/jdk12/8214388-dumptime-fragmentation.v02-delta/ 
>>>>>
>>>>>
>>>>> Please see responses to your comments below.
>>>>>
>>>>> I also added code to handle humongous strings that can be
>>>>> allocated during dump time, and make sure they are GC'ed (and thus
>>>>> no humongous regions) before we starting dumping the heap.
>>>>>
>>>>> I'll add one more test to test the handling of humongous regions
>>>>> that exist while the heap is being dumped, but that will require
>>>>> a little hacking with JVMTI. I'll post that later :-)
>>>>>
>>>>>
>>>>> On 11/30/18 3:17 AM, Stefan Johansson wrote:
>>>>>> Hi Ioi,
>>>>>>
>>>>>> I mostly looked at the GC interaction, some comments below.
>>>>>>
>>>>>> On 2018-11-30 02:07, Ioi Lam wrote:
>>>>>>> http://cr.openjdk.java.net/~iklam/jdk12/8214388-dumptime-fragmentation.v01/ 
>>>>>>>
>>>>>> src/hotspot/share/gc/g1/g1CollectedHeap.cpp
>>>>>> -------------------------------------------
>>>>>>  325 #if INCLUDE_CDS
>>>>>>  326   if (DumpSharedSpaces && 
>>>>>> HeapShared::is_heap_object_archiving_allowed()) {
>>>>>>  327     // See G1HeapVerifier::verify_ready_for_archiving()
>>>>>>  328     // This should not happen during normal operation of 
>>>>>> -Xshare:dump, so let's give a warning.
>>>>>>  329     log_warning(gc)("Avoid allocating humongous objects 
>>>>>> during -Xshare:dump (" SIZE_FORMAT
>>>>>>  330                     " bytes) - may cause fragmentation",
>>>>>>  331                     word_size * HeapWordSize);
>>>>>>  332   }
>>>>>>  333 #endif
>>>>>>
>>>>>> I think this warning should be left out, you still have output in 
>>>>>> the verification that tells the user what is problematic.
>>>>>>
>>>>>
>>>>> Removed.
>>>>>
>>>>>> /src/hotspot/share/gc/g1/g1HeapVerifier.cpp
>>>>>> -------------------------------------------
>>>>>> What do you think about extending the verification with one more 
>>>>>> state, maybe _fatal_hole. When iterating the regions you can then 
>>>>>> look for holes not caused by humongous. If such holes are found 
>>>>>> log an error and fail.
>>>>>>
>>>>>> You can then turn the current log_error into a log_warning 
>>>>>> letting the user know there has been humongous allocations which 
>>>>>> can cause problems during dump time.
>>>>>>
>>>>>>  314   // turn on the following assert only if we disallow 
>>>>>> humongous allocations during
>>>>>>  315   // dump time.
>>>>>>  316   //assert(!cl.has_holes(), "must not have holes");
>>>>>>
>>>>>> I think you should remove these lines.
>>>>>>
>>>>>
>>>>> I modified the verification loop as you suggested, and changed the 
>>>>> assert to check that all holes must be caused by humongous regions.
>>>>>
>>>>>> src/hotspot/share/gc/shared/adaptiveSizePolicy.cpp
>>>>>> --------------------------------------------------
>>>>>>  184 #if INCLUDE_CDS
>>>>>>  185   if (DumpSharedSpaces && 
>>>>>> HeapShared::is_heap_object_archiving_allowed()) {
>>>>>>  186     // See G1HeapVerifier::verify_ready_for_archiving()
>>>>>>  187     return 1;
>>>>>>  188   }
>>>>>>  189 #endif
>>>>>>  190
>>>>>>
>>>>>> I think this decision should be moved to the g1 argument parsing. 
>>>>>> In G1Arguments::initialize() you can add this after the initial 
>>>>>> setting of ParallelGCThreads.
>>>>>>
>>>>>>   // When dumping the CDS archive we want to reduce fragmentation by
>>>>>>   // triggering a full collection. To get as low fragmentation as
>>>>>>   // possible we only use one worker thread.
>>>>>>   if (DumpSharedSpaces) {
>>>>>>     FLAG_SET_ERGO(uint, ParallelGCThreads, 1);
>>>>>>     FLAG_SET_ERGO(bool, ExplicitGCInvokesConcurrent, false);
>>>>>>   }
>>>>>>
>>>>>
>>>>> Done.
>>>>>
>>>>>> As you see I also include setting ExplicitGCInvokesConcurrent to 
>>>>>> false, because if that flag is set a full gc will be turned into 
>>>>>> a concurrent one.
>>>>>>
>>>>>> src/hotspot/share/memory/metaspaceShared.cpp
>>>>>> --------------------------------------------
>>>>>> 1706 #if INCLUDE_CDS_JAVA_HEAP
>>>>>> 1707     if (UseG1GC && 
>>>>>> HeapShared::is_heap_object_archiving_allowed()) {
>>>>>> 1708       // Avoid fragmentation while archiving heap objects.
>>>>>> 1709 
>>>>>> Universe::heap()->soft_ref_policy()->set_should_clear_all_soft_refs(true);
>>>>>> 1710 Universe::heap()->collect(GCCause::_java_lang_system_gc);
>>>>>> 1711 
>>>>>> Universe::heap()->soft_ref_policy()->set_should_clear_all_soft_refs(false);
>>>>>> 1712       G1HeapVerifier::verify_ready_for_archiving();
>>>>>> 1713     }
>>>>>> 1714 #endif
>>>>>>
>>>>>> Do you need the #if here, is_heap_object_archiving_allowed() only 
>>>>>> returns true when INCLUDE_CDS_JAVA_HEAP is true. Also the check 
>>>>>> for UseG1GC is also done within is_heap_object_archiving_allowed().
>>>>>>
>>>>>> Regarding triggering the GC, the calls to the soft_ref_policy is 
>>>>>> not needed unless you care about clearing soft references.
>>>>>
>>>>> I removed the #if. I kept the clearing of soft refs -- soft refs 
>>>>> are not archived, so clearing them would give more free space for 
>>>>> archiving.
>>>>>
>>>>>> An even nicer solution would be to add a new GCCause that is only 
>>>>>> used for CDS dumping, that way you can also get a better message 
>>>>>> in the GC log than "System.gc()" and you can also skip the 
>>>>>> setting of ExplicitGCInvokesConcurrent that I mentioned above. 
>>>>>> Since the new GCCause won't be affected by that flag.
>>>>>>
>>>>>
>>>>> I added GCCause::_cds_dump and skipped the setting of 
>>>>> ExplicitGCInvokesConcurrent
>>>>>
>>>>> Thanks
>>>>> - Ioi
>>>>>
>>>>>> Thanks,
>>>>>> Stefan
>>>>>>
>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8214388
>>>>>>>
>>>>>>>
>>>>>>> Symptom:
>>>>>>> ========
>>>>>>>
>>>>>>> "java -Xshare:dump" would intermittently fail with
>>>>>>>
>>>>>>>     Unable to write archive heap ... due to fragmentation.
>>>>>>>
>>>>>>> This usually happens when you try to dump many classes (e.g. 
>>>>>>> 10000) with
>>>>>>> a relatively small heap (e.g., 1g) with a lot of GC threads 
>>>>>>> (e.g., 24).
>>>>>>>
>>>>>>> (Example use case -- Eclipse IDE loads 15,000 classes with 512MB 
>>>>>>> heap.)
>>>>>>>
>>>>>>> When GC happens during class loading, some old G1 regions  may 
>>>>>>> be placed
>>>>>>> at the top end of the heap (due to large number of GC threads).
>>>>>>>
>>>>>>> Later, when writing the archived heap, G1 tries to allocate 
>>>>>>> contiguous
>>>>>>> regions from the top end of the heap. This would fail due to the 
>>>>>>> presence
>>>>>>> of those old regions.
>>>>>>>
>>>>>>>
>>>>>>> Fix:
>>>>>>> ====
>>>>>>>
>>>>>>> As suggested by Stefan Johansson, we run a full GC with a single GC
>>>>>>> thread. This guarantees that all old blocks will be moved to the 
>>>>>>> bottom
>>>>>>> end of the heap.
>>>>>>>
>>>>>>> Because there's no API for specifying the number of GC threads 
>>>>>>> dynamically,
>>>>>>> and CDS dump time doesn't allocates lots of objects, I have 
>>>>>>> statically forced
>>>>>>> the number of threads to 1 in 
>>>>>>> AdaptiveSizePolicy::calc_active_workers during
>>>>>>> CDS dump time.
>>>>>>>
>>>>>>> (This seems like a more direct way than assigning 
>>>>>>> ParallelGCThreads ...)
>>>>>>>
>>>>>>>
>>>>>>> Notes:
>>>>>>> ======
>>>>>>>
>>>>>>> 1. Humongous regions cannot move. However, currently we don't do 
>>>>>>> humongous
>>>>>>>     allocations during CDS dump, so we should be fine. I have 
>>>>>>> added diagnostics
>>>>>>>     warnings so if fragmentation does happen in the future, the 
>>>>>>> user can
>>>>>>>     find out why.
>>>>>>>
>>>>>>> 2. Fixed a minor bug in 
>>>>>>> HeapShared::check_closed_archive_heap_region_object_class
>>>>>>>
>>>>>>> 3. Fixed a bug in MetaspaceShared::read_extra_data, where the 
>>>>>>> symbol/strings
>>>>>>>     would be lost due to GC.
>>>>>>>
>>>>>>> 4. Added stress test to successfully archive about 18MB of 
>>>>>>> objects with -Xmx64m.
>>>>>>>     This used to fail even with -Xmx512m on a Solaris box.
>>>>>>>
>>>>>>> 5. With default CDS archive generation during JDK build time, 
>>>>>>> -Xmx128m is used.
>>>>>>>     Before this fix, the EDEN region lives at the top of the 
>>>>>>> heap during CDS dump
>>>>>>>     time, and we end up with a 2MB gap between the archive 
>>>>>>> regions and the top
>>>>>>>     of the heap. Because the archive regions cannot move, at run 
>>>>>>> time, using CDS
>>>>>>>     would reduce the max humongous allocation by 2MB.
>>>>>>>
>>>>>>>     With this fix, the archive regions are now placed at the 
>>>>>>> very top of the heap,
>>>>>>>     so the gap no longer exists.
>>>>>>>
>>>>>>>
>>>>>>> Tests:
>>>>>>> ======
>>>>>>>
>>>>>>> Running hs-tiers{1-6} for sanity.
>>>>>>>
>>>>>>> Thanks
>>>>>>> - Ioi
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>