RFR(S) 8214388 CDS dumping fails with java heap fragmentation

Fri Nov 30 20:11:22 UTC 2018

On 11/30/18 11:02 AM, Ioi Lam wrote:

> Hi Stefan,
>
> Thanks for the review. Here's an updated patch:
>
> http://cr.openjdk.java.net/~iklam/jdk12/8214388-dumptime-fragmentation.v02/ 
>
> http://cr.openjdk.java.net/~iklam/jdk12/8214388-dumptime-fragmentation.v02-delta/ 
>
>
> Please see responses to your comments below.
>
> I also added code to handle humongous strings that can be
> allocated during dump time, and make sure they are GC'ed (and thus
> no humongous regions) before we starting dumping the heap.
>
> I'll add one more test to test the handling of humongous regions
> that exist while the heap is being dumped, but that will require
> a little hacking with JVMTI. I'll post that later :-)
>
^^^^^ I added a new test "HumongousDuringDump.java" for the above
purpose and updated the webrev in-place.

Thanks
- Ioi

>
> On 11/30/18 3:17 AM, Stefan Johansson wrote:
>> Hi Ioi,
>>
>> I mostly looked at the GC interaction, some comments below.
>>
>> On 2018-11-30 02:07, Ioi Lam wrote:
>>> http://cr.openjdk.java.net/~iklam/jdk12/8214388-dumptime-fragmentation.v01/ 
>>>
>> src/hotspot/share/gc/g1/g1CollectedHeap.cpp
>> -------------------------------------------
>>  325 #if INCLUDE_CDS
>>  326   if (DumpSharedSpaces && 
>> HeapShared::is_heap_object_archiving_allowed()) {
>>  327     // See G1HeapVerifier::verify_ready_for_archiving()
>>  328     // This should not happen during normal operation of 
>> -Xshare:dump, so let's give a warning.
>>  329     log_warning(gc)("Avoid allocating humongous objects during 
>> -Xshare:dump (" SIZE_FORMAT
>>  330                     " bytes) - may cause fragmentation",
>>  331                     word_size * HeapWordSize);
>>  332   }
>>  333 #endif
>>
>> I think this warning should be left out, you still have output in the 
>> verification that tells the user what is problematic.
>>
>
> Removed.
>
>> /src/hotspot/share/gc/g1/g1HeapVerifier.cpp
>> -------------------------------------------
>> What do you think about extending the verification with one more 
>> state, maybe _fatal_hole. When iterating the regions you can then 
>> look for holes not caused by humongous. If such holes are found log 
>> an error and fail.
>>
>> You can then turn the current log_error into a log_warning letting 
>> the user know there has been humongous allocations which can cause 
>> problems during dump time.
>>
>>  314   // turn on the following assert only if we disallow humongous 
>> allocations during
>>  315   // dump time.
>>  316   //assert(!cl.has_holes(), "must not have holes");
>>
>> I think you should remove these lines.
>>
>
> I modified the verification loop as you suggested, and changed the 
> assert to check that all holes must be caused by humongous regions.
>
>> src/hotspot/share/gc/shared/adaptiveSizePolicy.cpp
>> --------------------------------------------------
>>  184 #if INCLUDE_CDS
>>  185   if (DumpSharedSpaces && 
>> HeapShared::is_heap_object_archiving_allowed()) {
>>  186     // See G1HeapVerifier::verify_ready_for_archiving()
>>  187     return 1;
>>  188   }
>>  189 #endif
>>  190
>>
>> I think this decision should be moved to the g1 argument parsing. In 
>> G1Arguments::initialize() you can add this after the initial setting 
>> of ParallelGCThreads.
>>
>>   // When dumping the CDS archive we want to reduce fragmentation by
>>   // triggering a full collection. To get as low fragmentation as
>>   // possible we only use one worker thread.
>>   if (DumpSharedSpaces) {
>>     FLAG_SET_ERGO(uint, ParallelGCThreads, 1);
>>     FLAG_SET_ERGO(bool, ExplicitGCInvokesConcurrent, false);
>>   }
>>
>
> Done.
>
>> As you see I also include setting ExplicitGCInvokesConcurrent to 
>> false, because if that flag is set a full gc will be turned into a 
>> concurrent one.
>>
>> src/hotspot/share/memory/metaspaceShared.cpp
>> --------------------------------------------
>> 1706 #if INCLUDE_CDS_JAVA_HEAP
>> 1707     if (UseG1GC && 
>> HeapShared::is_heap_object_archiving_allowed()) {
>> 1708       // Avoid fragmentation while archiving heap objects.
>> 1709 
>> Universe::heap()->soft_ref_policy()->set_should_clear_all_soft_refs(true);
>> 1710 Universe::heap()->collect(GCCause::_java_lang_system_gc);
>> 1711 
>> Universe::heap()->soft_ref_policy()->set_should_clear_all_soft_refs(false);
>> 1712       G1HeapVerifier::verify_ready_for_archiving();
>> 1713     }
>> 1714 #endif
>>
>> Do you need the #if here, is_heap_object_archiving_allowed() only 
>> returns true when INCLUDE_CDS_JAVA_HEAP is true. Also the check for 
>> UseG1GC is also done within is_heap_object_archiving_allowed().
>>
>> Regarding triggering the GC, the calls to the soft_ref_policy is not 
>> needed unless you care about clearing soft references.
>
> I removed the #if. I kept the clearing of soft refs -- soft refs are 
> not archived, so clearing them would give more free space for archiving.
>
>> An even nicer solution would be to add a new GCCause that is only 
>> used for CDS dumping, that way you can also get a better message in 
>> the GC log than "System.gc()" and you can also skip the setting of 
>> ExplicitGCInvokesConcurrent that I mentioned above. Since the new 
>> GCCause won't be affected by that flag.
>>
>
> I added GCCause::_cds_dump and skipped the setting of 
> ExplicitGCInvokesConcurrent
>
> Thanks
> - Ioi
>
>> Thanks,
>> Stefan
>>
>>> https://bugs.openjdk.java.net/browse/JDK-8214388
>>>
>>>
>>> Symptom:
>>> ========
>>>
>>> "java -Xshare:dump" would intermittently fail with
>>>
>>>     Unable to write archive heap ... due to fragmentation.
>>>
>>> This usually happens when you try to dump many classes (e.g. 10000) 
>>> with
>>> a relatively small heap (e.g., 1g) with a lot of GC threads (e.g., 24).
>>>
>>> (Example use case -- Eclipse IDE loads 15,000 classes with 512MB heap.)
>>>
>>> When GC happens during class loading, some old G1 regions  may be 
>>> placed
>>> at the top end of the heap (due to large number of GC threads).
>>>
>>> Later, when writing the archived heap, G1 tries to allocate contiguous
>>> regions from the top end of the heap. This would fail due to the 
>>> presence
>>> of those old regions.
>>>
>>>
>>> Fix:
>>> ====
>>>
>>> As suggested by Stefan Johansson, we run a full GC with a single GC
>>> thread. This guarantees that all old blocks will be moved to the bottom
>>> end of the heap.
>>>
>>> Because there's no API for specifying the number of GC threads 
>>> dynamically,
>>> and CDS dump time doesn't allocates lots of objects, I have 
>>> statically forced
>>> the number of threads to 1 in 
>>> AdaptiveSizePolicy::calc_active_workers during
>>> CDS dump time.
>>>
>>> (This seems like a more direct way than assigning ParallelGCThreads 
>>> ...)
>>>
>>>
>>> Notes:
>>> ======
>>>
>>> 1. Humongous regions cannot move. However, currently we don't do 
>>> humongous
>>>     allocations during CDS dump, so we should be fine. I have added 
>>> diagnostics
>>>     warnings so if fragmentation does happen in the future, the user 
>>> can
>>>     find out why.
>>>
>>> 2. Fixed a minor bug in 
>>> HeapShared::check_closed_archive_heap_region_object_class
>>>
>>> 3. Fixed a bug in MetaspaceShared::read_extra_data, where the 
>>> symbol/strings
>>>     would be lost due to GC.
>>>
>>> 4. Added stress test to successfully archive about 18MB of objects 
>>> with -Xmx64m.
>>>     This used to fail even with -Xmx512m on a Solaris box.
>>>
>>> 5. With default CDS archive generation during JDK build time, 
>>> -Xmx128m is used.
>>>     Before this fix, the EDEN region lives at the top of the heap 
>>> during CDS dump
>>>     time, and we end up with a 2MB gap between the archive regions 
>>> and the top
>>>     of the heap. Because the archive regions cannot move, at run 
>>> time, using CDS
>>>     would reduce the max humongous allocation by 2MB.
>>>
>>>     With this fix, the archive regions are now placed at the very 
>>> top of the heap,
>>>     so the gap no longer exists.
>>>
>>>
>>> Tests:
>>> ======
>>>
>>> Running hs-tiers{1-6} for sanity.
>>>
>>> Thanks
>>> - Ioi
>>>
>>>
>