RFR(S) 8214388 CDS dumping fails with java heap fragmentation

Mon Dec 3 11:38:53 UTC 2018

I didn't review but looked at this because I was curious.

http://cr.openjdk.java.net/~iklam/jdk12/8214388-dumptime-fragmentation.v03-delta/src/hotspot/share/memory/heapShared.cpp.udiff.html

+ if (!fs.access_flags().is_final() && (ft == T_ARRAY || T_OBJECT)) {

This doesn't look right.
thanks,
Coleen

On 11/30/18 5:41 PM, Ioi Lam wrote:
> (Resending since my last mail seems to have been mangled)
>
> Hi Jiangli,
>
> Thanks for the review!
>
> Updated webrev: 
> http://cr.openjdk.java.net/~iklam/jdk12/8214388-dumptime-fragmentation.v03-delta/
>
>
> On 11/30/18 1:04 PM, Jiangli Zhou wrote:
>> Hi Ioi,
>>
>> - src/hotspot/share/gc/g1/g1Arguments.cpp
>>
>> You missed FLAG_SET_ERGO(bool, ExplicitGCInvokesConcurrent, false);
>
>
> As Stefan mentioned below, doing this is not necessary because I have 
> added a new GC cause.
>
>>
>> - src/hotspot/share/gc/g1/g1CollectedHeap.cpp
>>
>> - src/hotspot/share/gc/shared/adaptiveSizePolicy.cpp
>>
>> Above files have no change.
>>
>
> I had multiple MQ patches that cancel each other out. Webrev still 
> generates an empty diff for those files. Please ignore them.
>
>
>> - src/hotspot/share/gc/g1/g1HeapVerifier.cpp
>>
>>  320     log_warning(gc, verify)("All free regions should be at the 
>> top end of the heap, but "
>>  321                             " we found holes. This is probably 
>> caused by (unmovable) humongous"
>>  322                             " allocations, and may lead to 
>> fragmentation while"
>>  323                             " writing archive heap memory 
>> regions.");
>>
>> This warning is not very clear and can be false alarming. The hole in 
>> the middle of the heap causes heap fragmentation, but not necessarily 
>> in the archived regions. I'd leave the warning out and let the 
>> existing code in FileMapInfo::write_archive_heap_regions to detect 
>> archive fragmentation and report error.
>>
>
> This warning is about a *possibility* of fragmentation. It gives the 
> user a chance to check what's going on, without having to wait for an 
> actual failure to happen.
>
> In practice, this means if the JDK start-up code is changed in the 
> future to allocate a humongous array, someone will discover this while 
> building the JDK and dumping the default archive. That way, we can fix 
> the JDK to avoid the use of humongous arrays.
>
> In fact, I just modified the code to warn against humongous regions 
> even if no holes are found.
>
>
>> - src/hotspot/share/gc/shared/gcCause.cpp
>>
>> I'd suggest renaming _cds_dump to _archive_time_gc. The word 
>> 'archive' has been adapted to be used in GC code.
>>
>> - src/hotspot/share/memory/filemap.cpp
>>
>
> Done
>
>
>> I see no change for line 686?
>>
>> - src/hotspot/share/memory/heapShared.cpp
>>
>
> A space character is missing after the word "fragmentation."
>
>
>> Please revert the renaming for 
>> HeapShared::check_closed_archive_heap_region_object. 
>> HeapShared::check_closed_archive_heap_region_object_class is not 
>> correct. We check the objects in the graph, not the classes. It's 
>> better to use a separate bug and changeset for the bug fix in this 
>> function. So it's not mixed with the change for 8214388 for better 
>> back-tracking.
>>
>
> I'll revert the whole change and create a new bug.
>
> I found the name check_closed_archive_heap_region_object confusing 
> because it says "object" but passes a class. I think it's better to 
> pass the object, and then have this function read the class from the 
> object. I'll send a separate RFR.
>
>
>> - src/hotspot/share/memory/metaspaceShared.cpp
>>
>> The G1 specific code needs to be wrapped with #if INCLUDE_G1GC.
>>
>> 391           log_warning(cds)("[line %d] extra interned string 
>> ignored; size too large: %d",
>> 392                            reader.last_line_no(), utf8_length);
>>
>> You can use cds+heap for object archiving related logging.
>>
>
> Done
>
>
>> - 1718     if (UseG1GC && 
>> HeapShared::is_heap_object_archiving_allowed()) {
>>
>> As Stefan also pointed out, you don't need UseG1GC check here. 
>> HeapShared::is_heap_object_archiving_allowed already includes it.
>>
>
> Fixed
>
>
>> G1HeapVerifier::verify_ready_for_archiving() is called into two 
>> places. One is in HeapShared::archive_java_heap_objects and the other 
>> one is in HeapShared::archive_java_heap_objects. The duplication 
>> should be eliminated.
>
>
> I removed the one inside metaspaceShared.cpp
>
>
>> I also have the question for the need for 
>> G1HeapVerifier::verify_ready_for_archiving(). As there is already 
>> more efficient check in FileMapInfo::write_archive_heap_regions to 
>> detect fragmented archive regions, verify_ready_for_archiving does 
>> not add additional values. It's also a more expensive check.
>>
>
> It's suppose to provide better logging for diagnosing future 
> fragmentations (much easier to read than tracing all the gc+region 
> logs.). I don't know about you, but for me trying to find out why 
> fragmentation happened in JDK-8214217 was pretty damn tedious!
>
> It's not expensive at all since it just scans the list of regions once.
>
> Thanks
>
> - Ioi
>
>
>> Thanks,
>> Jiangli
>>
>> On 11/30/18 11:02 AM, Ioi Lam wrote:
>>> Hi Stefan,
>>>
>>> Thanks for the review. Here's an updated patch:
>>>
>>> http://cr.openjdk.java.net/~iklam/jdk12/8214388-dumptime-fragmentation.v02/ 
>>>
>>> http://cr.openjdk.java.net/~iklam/jdk12/8214388-dumptime-fragmentation.v02-delta/ 
>>>
>>>
>>> Please see responses to your comments below.
>>>
>>> I also added code to handle humongous strings that can be
>>> allocated during dump time, and make sure they are GC'ed (and thus
>>> no humongous regions) before we starting dumping the heap.
>>>
>>> I'll add one more test to test the handling of humongous regions
>>> that exist while the heap is being dumped, but that will require
>>> a little hacking with JVMTI. I'll post that later :-)
>>>
>>>
>>> On 11/30/18 3:17 AM, Stefan Johansson wrote:
>>>> Hi Ioi,
>>>>
>>>> I mostly looked at the GC interaction, some comments below.
>>>>
>>>> On 2018-11-30 02:07, Ioi Lam wrote:
>>>>> http://cr.openjdk.java.net/~iklam/jdk12/8214388-dumptime-fragmentation.v01/ 
>>>>>
>>>> src/hotspot/share/gc/g1/g1CollectedHeap.cpp
>>>> -------------------------------------------
>>>>  325 #if INCLUDE_CDS
>>>>  326   if (DumpSharedSpaces && 
>>>> HeapShared::is_heap_object_archiving_allowed()) {
>>>>  327     // See G1HeapVerifier::verify_ready_for_archiving()
>>>>  328     // This should not happen during normal operation of 
>>>> -Xshare:dump, so let's give a warning.
>>>>  329     log_warning(gc)("Avoid allocating humongous objects during 
>>>> -Xshare:dump (" SIZE_FORMAT
>>>>  330                     " bytes) - may cause fragmentation",
>>>>  331                     word_size * HeapWordSize);
>>>>  332   }
>>>>  333 #endif
>>>>
>>>> I think this warning should be left out, you still have output in 
>>>> the verification that tells the user what is problematic.
>>>>
>>>
>>> Removed.
>>>
>>>> /src/hotspot/share/gc/g1/g1HeapVerifier.cpp
>>>> -------------------------------------------
>>>> What do you think about extending the verification with one more 
>>>> state, maybe _fatal_hole. When iterating the regions you can then 
>>>> look for holes not caused by humongous. If such holes are found log 
>>>> an error and fail.
>>>>
>>>> You can then turn the current log_error into a log_warning letting 
>>>> the user know there has been humongous allocations which can cause 
>>>> problems during dump time.
>>>>
>>>>  314   // turn on the following assert only if we disallow 
>>>> humongous allocations during
>>>>  315   // dump time.
>>>>  316   //assert(!cl.has_holes(), "must not have holes");
>>>>
>>>> I think you should remove these lines.
>>>>
>>>
>>> I modified the verification loop as you suggested, and changed the 
>>> assert to check that all holes must be caused by humongous regions.
>>>
>>>> src/hotspot/share/gc/shared/adaptiveSizePolicy.cpp
>>>> --------------------------------------------------
>>>>  184 #if INCLUDE_CDS
>>>>  185   if (DumpSharedSpaces && 
>>>> HeapShared::is_heap_object_archiving_allowed()) {
>>>>  186     // See G1HeapVerifier::verify_ready_for_archiving()
>>>>  187     return 1;
>>>>  188   }
>>>>  189 #endif
>>>>  190
>>>>
>>>> I think this decision should be moved to the g1 argument parsing. 
>>>> In G1Arguments::initialize() you can add this after the initial 
>>>> setting of ParallelGCThreads.
>>>>
>>>>   // When dumping the CDS archive we want to reduce fragmentation by
>>>>   // triggering a full collection. To get as low fragmentation as
>>>>   // possible we only use one worker thread.
>>>>   if (DumpSharedSpaces) {
>>>>     FLAG_SET_ERGO(uint, ParallelGCThreads, 1);
>>>>     FLAG_SET_ERGO(bool, ExplicitGCInvokesConcurrent, false);
>>>>   }
>>>>
>>>
>>> Done.
>>>
>>>> As you see I also include setting ExplicitGCInvokesConcurrent to 
>>>> false, because if that flag is set a full gc will be turned into a 
>>>> concurrent one.
>>>>
>>>> src/hotspot/share/memory/metaspaceShared.cpp
>>>> --------------------------------------------
>>>> 1706 #if INCLUDE_CDS_JAVA_HEAP
>>>> 1707     if (UseG1GC && 
>>>> HeapShared::is_heap_object_archiving_allowed()) {
>>>> 1708       // Avoid fragmentation while archiving heap objects.
>>>> 1709 
>>>> Universe::heap()->soft_ref_policy()->set_should_clear_all_soft_refs(true);
>>>> 1710 Universe::heap()->collect(GCCause::_java_lang_system_gc);
>>>> 1711 
>>>> Universe::heap()->soft_ref_policy()->set_should_clear_all_soft_refs(false);
>>>> 1712       G1HeapVerifier::verify_ready_for_archiving();
>>>> 1713     }
>>>> 1714 #endif
>>>>
>>>> Do you need the #if here, is_heap_object_archiving_allowed() only 
>>>> returns true when INCLUDE_CDS_JAVA_HEAP is true. Also the check for 
>>>> UseG1GC is also done within is_heap_object_archiving_allowed().
>>>>
>>>> Regarding triggering the GC, the calls to the soft_ref_policy is 
>>>> not needed unless you care about clearing soft references.
>>>
>>> I removed the #if. I kept the clearing of soft refs -- soft refs are 
>>> not archived, so clearing them would give more free space for 
>>> archiving.
>>>
>>>> An even nicer solution would be to add a new GCCause that is only 
>>>> used for CDS dumping, that way you can also get a better message in 
>>>> the GC log than "System.gc()" and you can also skip the setting of 
>>>> ExplicitGCInvokesConcurrent that I mentioned above. Since the new 
>>>> GCCause won't be affected by that flag.
>>>>
>>>
>>> I added GCCause::_cds_dump and skipped the setting of 
>>> ExplicitGCInvokesConcurrent
>>>
>>> Thanks
>>> - Ioi
>>>
>>>> Thanks,
>>>> Stefan
>>>>
>>>>> https://bugs.openjdk.java.net/browse/JDK-8214388
>>>>>
>>>>>
>>>>> Symptom:
>>>>> ========
>>>>>
>>>>> "java -Xshare:dump" would intermittently fail with
>>>>>
>>>>>     Unable to write archive heap ... due to fragmentation.
>>>>>
>>>>> This usually happens when you try to dump many classes (e.g. 
>>>>> 10000) with
>>>>> a relatively small heap (e.g., 1g) with a lot of GC threads (e.g., 
>>>>> 24).
>>>>>
>>>>> (Example use case -- Eclipse IDE loads 15,000 classes with 512MB 
>>>>> heap.)
>>>>>
>>>>> When GC happens during class loading, some old G1 regions may be 
>>>>> placed
>>>>> at the top end of the heap (due to large number of GC threads).
>>>>>
>>>>> Later, when writing the archived heap, G1 tries to allocate 
>>>>> contiguous
>>>>> regions from the top end of the heap. This would fail due to the 
>>>>> presence
>>>>> of those old regions.
>>>>>
>>>>>
>>>>> Fix:
>>>>> ====
>>>>>
>>>>> As suggested by Stefan Johansson, we run a full GC with a single GC
>>>>> thread. This guarantees that all old blocks will be moved to the 
>>>>> bottom
>>>>> end of the heap.
>>>>>
>>>>> Because there's no API for specifying the number of GC threads 
>>>>> dynamically,
>>>>> and CDS dump time doesn't allocates lots of objects, I have 
>>>>> statically forced
>>>>> the number of threads to 1 in 
>>>>> AdaptiveSizePolicy::calc_active_workers during
>>>>> CDS dump time.
>>>>>
>>>>> (This seems like a more direct way than assigning 
>>>>> ParallelGCThreads ...)
>>>>>
>>>>>
>>>>> Notes:
>>>>> ======
>>>>>
>>>>> 1. Humongous regions cannot move. However, currently we don't do 
>>>>> humongous
>>>>>     allocations during CDS dump, so we should be fine. I have 
>>>>> added diagnostics
>>>>>     warnings so if fragmentation does happen in the future, the 
>>>>> user can
>>>>>     find out why.
>>>>>
>>>>> 2. Fixed a minor bug in 
>>>>> HeapShared::check_closed_archive_heap_region_object_class
>>>>>
>>>>> 3. Fixed a bug in MetaspaceShared::read_extra_data, where the 
>>>>> symbol/strings
>>>>>     would be lost due to GC.
>>>>>
>>>>> 4. Added stress test to successfully archive about 18MB of objects 
>>>>> with -Xmx64m.
>>>>>     This used to fail even with -Xmx512m on a Solaris box.
>>>>>
>>>>> 5. With default CDS archive generation during JDK build time, 
>>>>> -Xmx128m is used.
>>>>>     Before this fix, the EDEN region lives at the top of the heap 
>>>>> during CDS dump
>>>>>     time, and we end up with a 2MB gap between the archive regions 
>>>>> and the top
>>>>>     of the heap. Because the archive regions cannot move, at run 
>>>>> time, using CDS
>>>>>     would reduce the max humongous allocation by 2MB.
>>>>>
>>>>>     With this fix, the archive regions are now placed at the very 
>>>>> top of the heap,
>>>>>     so the gap no longer exists.
>>>>>
>>>>>
>>>>> Tests:
>>>>> ======
>>>>>
>>>>> Running hs-tiers{1-6} for sanity.
>>>>>
>>>>> Thanks
>>>>> - Ioi
>>>>>
>>>>>
>>>
>>