RFR(S) 8214388 CDS dumping fails with java heap fragmentation

Mon Dec 3 18:11:18 UTC 2018

Yes, that's a bug. I was going to fix it as part of this patch, but 
Jiangli suggested to use a different bug ID instead. So I filed

https://bugs.openjdk.java.net/browse/JDK-8214726

Thanks

- Ioi

On 12/3/18 3:38 AM, coleen.phillimore at oracle.com wrote:
>
> I didn't review but looked at this because I was curious.
>
> http://cr.openjdk.java.net/~iklam/jdk12/8214388-dumptime-fragmentation.v03-delta/src/hotspot/share/memory/heapShared.cpp.udiff.html 
>
>
> + if (!fs.access_flags().is_final() && (ft == T_ARRAY || T_OBJECT)) {
>
>
> This doesn't look right.
> thanks,
> Coleen
>
> On 11/30/18 5:41 PM, Ioi Lam wrote:
>> (Resending since my last mail seems to have been mangled)
>>
>> Hi Jiangli,
>>
>> Thanks for the review!
>>
>> Updated webrev: 
>> http://cr.openjdk.java.net/~iklam/jdk12/8214388-dumptime-fragmentation.v03-delta/
>>
>>
>> On 11/30/18 1:04 PM, Jiangli Zhou wrote:
>>> Hi Ioi,
>>>
>>> - src/hotspot/share/gc/g1/g1Arguments.cpp
>>>
>>> You missed FLAG_SET_ERGO(bool, ExplicitGCInvokesConcurrent, false);
>>
>>
>> As Stefan mentioned below, doing this is not necessary because I have 
>> added a new GC cause.
>>
>>>
>>> - src/hotspot/share/gc/g1/g1CollectedHeap.cpp
>>>
>>> - src/hotspot/share/gc/shared/adaptiveSizePolicy.cpp
>>>
>>> Above files have no change.
>>>
>>
>> I had multiple MQ patches that cancel each other out. Webrev still 
>> generates an empty diff for those files. Please ignore them.
>>
>>
>>> - src/hotspot/share/gc/g1/g1HeapVerifier.cpp
>>>
>>>  320     log_warning(gc, verify)("All free regions should be at the 
>>> top end of the heap, but "
>>>  321                             " we found holes. This is probably 
>>> caused by (unmovable) humongous"
>>>  322                             " allocations, and may lead to 
>>> fragmentation while"
>>>  323                             " writing archive heap memory 
>>> regions.");
>>>
>>> This warning is not very clear and can be false alarming. The hole 
>>> in the middle of the heap causes heap fragmentation, but not 
>>> necessarily in the archived regions. I'd leave the warning out and 
>>> let the existing code in FileMapInfo::write_archive_heap_regions to 
>>> detect archive fragmentation and report error.
>>>
>>
>> This warning is about a *possibility* of fragmentation. It gives the 
>> user a chance to check what's going on, without having to wait for an 
>> actual failure to happen.
>>
>> In practice, this means if the JDK start-up code is changed in the 
>> future to allocate a humongous array, someone will discover this 
>> while building the JDK and dumping the default archive. That way, we 
>> can fix the JDK to avoid the use of humongous arrays.
>>
>> In fact, I just modified the code to warn against humongous regions 
>> even if no holes are found.
>>
>>
>>> - src/hotspot/share/gc/shared/gcCause.cpp
>>>
>>> I'd suggest renaming _cds_dump to _archive_time_gc. The word 
>>> 'archive' has been adapted to be used in GC code.
>>>
>>> - src/hotspot/share/memory/filemap.cpp
>>>
>>
>> Done
>>
>>
>>> I see no change for line 686?
>>>
>>> - src/hotspot/share/memory/heapShared.cpp
>>>
>>
>> A space character is missing after the word "fragmentation."
>>
>>
>>> Please revert the renaming for 
>>> HeapShared::check_closed_archive_heap_region_object. 
>>> HeapShared::check_closed_archive_heap_region_object_class is not 
>>> correct. We check the objects in the graph, not the classes. It's 
>>> better to use a separate bug and changeset for the bug fix in this 
>>> function. So it's not mixed with the change for 8214388 for better 
>>> back-tracking.
>>>
>>
>> I'll revert the whole change and create a new bug.
>>
>> I found the name check_closed_archive_heap_region_object confusing 
>> because it says "object" but passes a class. I think it's better to 
>> pass the object, and then have this function read the class from the 
>> object. I'll send a separate RFR.
>>
>>
>>> - src/hotspot/share/memory/metaspaceShared.cpp
>>>
>>> The G1 specific code needs to be wrapped with #if INCLUDE_G1GC.
>>>
>>> 391           log_warning(cds)("[line %d] extra interned string 
>>> ignored; size too large: %d",
>>> 392                            reader.last_line_no(), utf8_length);
>>>
>>> You can use cds+heap for object archiving related logging.
>>>
>>
>> Done
>>
>>
>>> - 1718     if (UseG1GC && 
>>> HeapShared::is_heap_object_archiving_allowed()) {
>>>
>>> As Stefan also pointed out, you don't need UseG1GC check here. 
>>> HeapShared::is_heap_object_archiving_allowed already includes it.
>>>
>>
>> Fixed
>>
>>
>>> G1HeapVerifier::verify_ready_for_archiving() is called into two 
>>> places. One is in HeapShared::archive_java_heap_objects and the 
>>> other one is in HeapShared::archive_java_heap_objects. The 
>>> duplication should be eliminated.
>>
>>
>> I removed the one inside metaspaceShared.cpp
>>
>>
>>> I also have the question for the need for 
>>> G1HeapVerifier::verify_ready_for_archiving(). As there is already 
>>> more efficient check in FileMapInfo::write_archive_heap_regions to 
>>> detect fragmented archive regions, verify_ready_for_archiving does 
>>> not add additional values. It's also a more expensive check.
>>>
>>
>> It's suppose to provide better logging for diagnosing future 
>> fragmentations (much easier to read than tracing all the gc+region 
>> logs.). I don't know about you, but for me trying to find out why 
>> fragmentation happened in JDK-8214217 was pretty damn tedious!
>>
>> It's not expensive at all since it just scans the list of regions once.
>>
>> Thanks
>>
>> - Ioi
>>
>>
>>> Thanks,
>>> Jiangli
>>>
>>> On 11/30/18 11:02 AM, Ioi Lam wrote:
>>>> Hi Stefan,
>>>>
>>>> Thanks for the review. Here's an updated patch:
>>>>
>>>> http://cr.openjdk.java.net/~iklam/jdk12/8214388-dumptime-fragmentation.v02/ 
>>>>
>>>> http://cr.openjdk.java.net/~iklam/jdk12/8214388-dumptime-fragmentation.v02-delta/ 
>>>>
>>>>
>>>> Please see responses to your comments below.
>>>>
>>>> I also added code to handle humongous strings that can be
>>>> allocated during dump time, and make sure they are GC'ed (and thus
>>>> no humongous regions) before we starting dumping the heap.
>>>>
>>>> I'll add one more test to test the handling of humongous regions
>>>> that exist while the heap is being dumped, but that will require
>>>> a little hacking with JVMTI. I'll post that later :-)
>>>>
>>>>
>>>> On 11/30/18 3:17 AM, Stefan Johansson wrote:
>>>>> Hi Ioi,
>>>>>
>>>>> I mostly looked at the GC interaction, some comments below.
>>>>>
>>>>> On 2018-11-30 02:07, Ioi Lam wrote:
>>>>>> http://cr.openjdk.java.net/~iklam/jdk12/8214388-dumptime-fragmentation.v01/ 
>>>>>>
>>>>> src/hotspot/share/gc/g1/g1CollectedHeap.cpp
>>>>> -------------------------------------------
>>>>>  325 #if INCLUDE_CDS
>>>>>  326   if (DumpSharedSpaces && 
>>>>> HeapShared::is_heap_object_archiving_allowed()) {
>>>>>  327     // See G1HeapVerifier::verify_ready_for_archiving()
>>>>>  328     // This should not happen during normal operation of 
>>>>> -Xshare:dump, so let's give a warning.
>>>>>  329     log_warning(gc)("Avoid allocating humongous objects 
>>>>> during -Xshare:dump (" SIZE_FORMAT
>>>>>  330                     " bytes) - may cause fragmentation",
>>>>>  331                     word_size * HeapWordSize);
>>>>>  332   }
>>>>>  333 #endif
>>>>>
>>>>> I think this warning should be left out, you still have output in 
>>>>> the verification that tells the user what is problematic.
>>>>>
>>>>
>>>> Removed.
>>>>
>>>>> /src/hotspot/share/gc/g1/g1HeapVerifier.cpp
>>>>> -------------------------------------------
>>>>> What do you think about extending the verification with one more 
>>>>> state, maybe _fatal_hole. When iterating the regions you can then 
>>>>> look for holes not caused by humongous. If such holes are found 
>>>>> log an error and fail.
>>>>>
>>>>> You can then turn the current log_error into a log_warning letting 
>>>>> the user know there has been humongous allocations which can cause 
>>>>> problems during dump time.
>>>>>
>>>>>  314   // turn on the following assert only if we disallow 
>>>>> humongous allocations during
>>>>>  315   // dump time.
>>>>>  316   //assert(!cl.has_holes(), "must not have holes");
>>>>>
>>>>> I think you should remove these lines.
>>>>>
>>>>
>>>> I modified the verification loop as you suggested, and changed the 
>>>> assert to check that all holes must be caused by humongous regions.
>>>>
>>>>> src/hotspot/share/gc/shared/adaptiveSizePolicy.cpp
>>>>> --------------------------------------------------
>>>>>  184 #if INCLUDE_CDS
>>>>>  185   if (DumpSharedSpaces && 
>>>>> HeapShared::is_heap_object_archiving_allowed()) {
>>>>>  186     // See G1HeapVerifier::verify_ready_for_archiving()
>>>>>  187     return 1;
>>>>>  188   }
>>>>>  189 #endif
>>>>>  190
>>>>>
>>>>> I think this decision should be moved to the g1 argument parsing. 
>>>>> In G1Arguments::initialize() you can add this after the initial 
>>>>> setting of ParallelGCThreads.
>>>>>
>>>>>   // When dumping the CDS archive we want to reduce fragmentation by
>>>>>   // triggering a full collection. To get as low fragmentation as
>>>>>   // possible we only use one worker thread.
>>>>>   if (DumpSharedSpaces) {
>>>>>     FLAG_SET_ERGO(uint, ParallelGCThreads, 1);
>>>>>     FLAG_SET_ERGO(bool, ExplicitGCInvokesConcurrent, false);
>>>>>   }
>>>>>
>>>>
>>>> Done.
>>>>
>>>>> As you see I also include setting ExplicitGCInvokesConcurrent to 
>>>>> false, because if that flag is set a full gc will be turned into a 
>>>>> concurrent one.
>>>>>
>>>>> src/hotspot/share/memory/metaspaceShared.cpp
>>>>> --------------------------------------------
>>>>> 1706 #if INCLUDE_CDS_JAVA_HEAP
>>>>> 1707     if (UseG1GC && 
>>>>> HeapShared::is_heap_object_archiving_allowed()) {
>>>>> 1708       // Avoid fragmentation while archiving heap objects.
>>>>> 1709 
>>>>> Universe::heap()->soft_ref_policy()->set_should_clear_all_soft_refs(true);
>>>>> 1710 Universe::heap()->collect(GCCause::_java_lang_system_gc);
>>>>> 1711 
>>>>> Universe::heap()->soft_ref_policy()->set_should_clear_all_soft_refs(false);
>>>>> 1712       G1HeapVerifier::verify_ready_for_archiving();
>>>>> 1713     }
>>>>> 1714 #endif
>>>>>
>>>>> Do you need the #if here, is_heap_object_archiving_allowed() only 
>>>>> returns true when INCLUDE_CDS_JAVA_HEAP is true. Also the check 
>>>>> for UseG1GC is also done within is_heap_object_archiving_allowed().
>>>>>
>>>>> Regarding triggering the GC, the calls to the soft_ref_policy is 
>>>>> not needed unless you care about clearing soft references.
>>>>
>>>> I removed the #if. I kept the clearing of soft refs -- soft refs 
>>>> are not archived, so clearing them would give more free space for 
>>>> archiving.
>>>>
>>>>> An even nicer solution would be to add a new GCCause that is only 
>>>>> used for CDS dumping, that way you can also get a better message 
>>>>> in the GC log than "System.gc()" and you can also skip the setting 
>>>>> of ExplicitGCInvokesConcurrent that I mentioned above. Since the 
>>>>> new GCCause won't be affected by that flag.
>>>>>
>>>>
>>>> I added GCCause::_cds_dump and skipped the setting of 
>>>> ExplicitGCInvokesConcurrent
>>>>
>>>> Thanks
>>>> - Ioi
>>>>
>>>>> Thanks,
>>>>> Stefan
>>>>>
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8214388
>>>>>>
>>>>>>
>>>>>> Symptom:
>>>>>> ========
>>>>>>
>>>>>> "java -Xshare:dump" would intermittently fail with
>>>>>>
>>>>>>     Unable to write archive heap ... due to fragmentation.
>>>>>>
>>>>>> This usually happens when you try to dump many classes (e.g. 
>>>>>> 10000) with
>>>>>> a relatively small heap (e.g., 1g) with a lot of GC threads 
>>>>>> (e.g., 24).
>>>>>>
>>>>>> (Example use case -- Eclipse IDE loads 15,000 classes with 512MB 
>>>>>> heap.)
>>>>>>
>>>>>> When GC happens during class loading, some old G1 regions may be 
>>>>>> placed
>>>>>> at the top end of the heap (due to large number of GC threads).
>>>>>>
>>>>>> Later, when writing the archived heap, G1 tries to allocate 
>>>>>> contiguous
>>>>>> regions from the top end of the heap. This would fail due to the 
>>>>>> presence
>>>>>> of those old regions.
>>>>>>
>>>>>>
>>>>>> Fix:
>>>>>> ====
>>>>>>
>>>>>> As suggested by Stefan Johansson, we run a full GC with a single GC
>>>>>> thread. This guarantees that all old blocks will be moved to the 
>>>>>> bottom
>>>>>> end of the heap.
>>>>>>
>>>>>> Because there's no API for specifying the number of GC threads 
>>>>>> dynamically,
>>>>>> and CDS dump time doesn't allocates lots of objects, I have 
>>>>>> statically forced
>>>>>> the number of threads to 1 in 
>>>>>> AdaptiveSizePolicy::calc_active_workers during
>>>>>> CDS dump time.
>>>>>>
>>>>>> (This seems like a more direct way than assigning 
>>>>>> ParallelGCThreads ...)
>>>>>>
>>>>>>
>>>>>> Notes:
>>>>>> ======
>>>>>>
>>>>>> 1. Humongous regions cannot move. However, currently we don't do 
>>>>>> humongous
>>>>>>     allocations during CDS dump, so we should be fine. I have 
>>>>>> added diagnostics
>>>>>>     warnings so if fragmentation does happen in the future, the 
>>>>>> user can
>>>>>>     find out why.
>>>>>>
>>>>>> 2. Fixed a minor bug in 
>>>>>> HeapShared::check_closed_archive_heap_region_object_class
>>>>>>
>>>>>> 3. Fixed a bug in MetaspaceShared::read_extra_data, where the 
>>>>>> symbol/strings
>>>>>>     would be lost due to GC.
>>>>>>
>>>>>> 4. Added stress test to successfully archive about 18MB of 
>>>>>> objects with -Xmx64m.
>>>>>>     This used to fail even with -Xmx512m on a Solaris box.
>>>>>>
>>>>>> 5. With default CDS archive generation during JDK build time, 
>>>>>> -Xmx128m is used.
>>>>>>     Before this fix, the EDEN region lives at the top of the heap 
>>>>>> during CDS dump
>>>>>>     time, and we end up with a 2MB gap between the archive 
>>>>>> regions and the top
>>>>>>     of the heap. Because the archive regions cannot move, at run 
>>>>>> time, using CDS
>>>>>>     would reduce the max humongous allocation by 2MB.
>>>>>>
>>>>>>     With this fix, the archive regions are now placed at the very 
>>>>>> top of the heap,
>>>>>>     so the gap no longer exists.
>>>>>>
>>>>>>
>>>>>> Tests:
>>>>>> ======
>>>>>>
>>>>>> Running hs-tiers{1-6} for sanity.
>>>>>>
>>>>>> Thanks
>>>>>> - Ioi
>>>>>>
>>>>>>
>>>>
>>>
>