RFR(S) 8214388 CDS dumping fails with java heap fragmentation

Mon Dec 3 09:28:51 UTC 2018

Hi Ioi,

Thanks for addressing my comments I think this version looks good from a 
G1 point of view.

Thanks,
Stefan

On 2018-11-30 23:36, Ioi Lam wrote:
> Hi Jiangli,
> 
> Thanks for the review!
> 
> 
> Updated webrev: 
> http://cr.openjdk.java.net/~iklam/jdk12/8214388-dumptime-fragmentation.v03-delta/
> 
> 
> On 11/30/18 1:04 PM, Jiangli Zhou wrote:
>> Hi Ioi,  > > - src/hotspot/share/gc/g1/g1Arguments.cpp > > You missed 
> FLAG_SET_ERGO(bool, ExplicitGCInvokesConcurrent, false);
> 
> As Stefan mentioned below, doing this is not necessary because I have 
> added a new GC cause.
> 
>>  > - src/hotspot/share/gc/g1/g1CollectedHeap.cpp > > - 
> src/hotspot/share/gc/shared/adaptiveSizePolicy.cpp > > Above files have 
> no change.
> 
> I had multiple MQ patches that cancel each other out. Webrev still 
> generates an empty diff for those files. Please ignore them.
> 
>>  > - src/hotspot/share/gc/g1/g1HeapVerifier.cpp > > 320 log_warning(gc, 
> verify)("All free regions should be at the > top end of the heap, but " 
> 321 " we found > holes. This is probably caused by (unmovable) 
> humongous" 322 > " allocations, and may lead to fragmentation while" 323 
>  > " writing archive heap memory regions."); > > This warning is not 
> very clear and can be false alarming. The hole in > the middle of the 
> heap causes heap fragmentation, but not necessarily > in the archived 
> regions. I'd leave the warning out and let the > existing code in 
> FileMapInfo::write_archive_heap_regions to detect > archive 
> fragmentation and report error. >
> This warning is about a *possibility* of fragmentation. It gives the 
> user a chance to check what's going on, without having to wait for an 
> actual failure to happen.
> 
> In practice, this means if the JDK start-up code is changed in the 
> future to allocate a humongous array, someone will discover this while 
> building the JDK and dumping the default archive. That way, we can fix 
> the JDK to avoid the use of humongous arrays.
> 
> In fact, I just modified the code to warn against humongous regions even 
> if no holes are found.
> 
> 
>> - src/hotspot/share/gc/shared/gcCause.cpp  > > I'd suggest renaming _cds_dump to _archive_time_gc. The word > 
> 'archive' has been adapted to be used in GC code. >
> 
> Done
> 
> 
>> - src/hotspot/share/memory/filemap.cpp  > > I see no change for line 686? >
> A space character is missing after the word "fragmentation."
> 
> 
>> - src/hotspot/share/memory/heapShared.cpp  > > Please revert the renaming for > 
> HeapShared::check_closed_archive_heap_region_object. > 
> HeapShared::check_closed_archive_heap_region_object_class is not > 
> correct. We check the objects in the graph, not the classes. It's > 
> better to use a separate bug and changeset for the bug fix in this > 
> function. So it's not mixed with the change for 8214388 for better > 
> back-tracking.
> I'll revert the whole change and create a new bug.
> 
> I found the name check_closed_archive_heap_region_object confusing 
> because it says "object" but passes a class. I think it's better to pass 
> the object, and then have this function read the class from the object. 
> I'll send a separate RFR.
> 
>>  > - src/hotspot/share/memory/metaspaceShared.cpp > > The G1 specific 
> code needs to be wrapped with #if INCLUDE_G1GC. >
> Done
> 
> 
>> 391           log_warning(cds)("[line %d] extra interned string  > ignored; size too large: %d", 392 > reader.last_line_no(), 
> utf8_length); > > You can use cds+heap for object archiving related 
> logging. >
> Done
> 
> 
>> - 1718     if (UseG1GC &&  > HeapShared::is_heap_object_archiving_allowed()) { > > As Stefan also 
> pointed out, you don't need UseG1GC check here. > 
> HeapShared::is_heap_object_archiving_allowed already includes it. >
> Fixed
> 
> 
>> G1HeapVerifier::verify_ready_for_archiving() is called into two  > places. One is in HeapShared::archive_java_heap_objects and the 
> other > one is in HeapShared::archive_java_heap_objects. The duplication 
>  > should be eliminated.
> I removed the one inside metaspaceShared.cpp
> 
> 
>> I also have the question for the need for  > G1HeapVerifier::verify_ready_for_archiving(). As there is already > 
> more efficient check in FileMapInfo::write_archive_heap_regions to > 
> detect fragmented archive regions, verify_ready_for_archiving does > not 
> add additional values. It's also a more expensive check. >
> It's suppose to provide better logging for diagnosing future 
> fragmentations (much easier to read than tracing all the gc+region 
> logs.). I don't know about you, but for me trying to find out why 
> fragmentation happened in JDK-8214217 was pretty damn tedious!
> 
> It's not expensive at all since it just scans the list of regions once.
> 
> Thanks
> 
> - Ioi
> 
> 
>> Thanks, Jiangli  > > On 11/30/18 11:02 AM, Ioi Lam wrote: >> Hi Stefan, >> >> Thanks 
> for the review. Here's an updated patch: >> >> 
> http://cr.openjdk.java.net/~iklam/jdk12/8214388-dumptime-fragmentation.v02/ 
>  >> >> 
> http://cr.openjdk.java.net/~iklam/jdk12/8214388-dumptime-fragmentation.v02-delta/
>>>  >> Please see responses to your comments below. >> >> I also added code 
> to handle humongous strings that can be allocated >> during dump time, 
> and make sure they are GC'ed (and thus no >> humongous regions) before 
> we starting dumping the heap. >> >> I'll add one more test to test the 
> handling of humongous regions >> that exist while the heap is being 
> dumped, but that will require a >> little hacking with JVMTI. I'll post 
> that later :-) >> >> >> On 11/30/18 3:17 AM, Stefan Johansson wrote: >>> 
> Hi Ioi, >>> >>> I mostly looked at the GC interaction, some comments 
> below. >>> >>> On 2018-11-30 02:07, Ioi Lam wrote: >>>> 
> http://cr.openjdk.java.net/~iklam/jdk12/8214388-dumptime-fragmentation.v01/ 
>  >>> >>>> src/hotspot/share/gc/g1/g1CollectedHeap.cpp
>>>> ------------------------------------------- 325 #if INCLUDE_CDS  >>> 326 if (DumpSharedSpaces && >>> 
> HeapShared::is_heap_object_archiving_allowed()) { 327 // See >>> 
> G1HeapVerifier::verify_ready_for_archiving() 328 // This >>> should not 
> happen during normal operation of -Xshare:dump, so >>> let's give a 
> warning. 329 log_warning(gc)("Avoid allocating >>> humongous objects 
> during -Xshare:dump (" SIZE_FORMAT 330 >>> " bytes) - may cause 
> fragmentation", 331 >>> word_size * HeapWordSize); 332 } 333 #endif >>> 
>  >>> I think this warning should be left out, you still have output in 
>  >>> the verification that tells the user what is problematic. >>> >> >> 
> Removed. >> >>> /src/hotspot/share/gc/g1/g1HeapVerifier.cpp >>> 
> ------------------------------------------- What do you think >>> about 
> extending the verification with one more state, maybe >>> _fatal_hole. 
> When iterating the regions you can then look for >>> holes not caused by 
> humongous. If such holes are found log an >>> error and fail. >>> >>> 
> You can then turn the current log_error into a log_warning >>> letting 
> the user know there has been humongous allocations which >>> can cause 
> problems during dump time. >>> >>> 314 // turn on the following assert 
> only if we disallow >>> humongous allocations during 315 // dump time. 
> 316 >>> //assert(!cl.has_holes(), "must not have holes"); >>> >>> I 
> think you should remove these lines. >>> >> >> I modified the 
> verification loop as you suggested, and changed the >> assert to check 
> that all holes must be caused by humongous >> regions. >> >>> 
> src/hotspot/share/gc/shared/adaptiveSizePolicy.cpp >>> 
> -------------------------------------------------- 184 #if >>> 
> INCLUDE_CDS 185 if (DumpSharedSpaces && >>> 
> HeapShared::is_heap_object_archiving_allowed()) { 186 // See >>> 
> G1HeapVerifier::verify_ready_for_archiving() 187 return 1; >>> 188 } 189 
> #endif 190 >>> >>> I think this decision should be moved to the g1 
> argument parsing. >>> In G1Arguments::initialize() you can add this 
> after the initial >>> setting of ParallelGCThreads. >>> >>> // When 
> dumping the CDS archive we want to reduce fragmentation >>> by // 
> triggering a full collection. To get as low fragmentation >>> as // 
> possible we only use one worker thread. if >>> (DumpSharedSpaces) { 
> FLAG_SET_ERGO(uint, ParallelGCThreads, 1); >>> FLAG_SET_ERGO(bool, 
> ExplicitGCInvokesConcurrent, false); } >>> >> >> Done. >> >>> As you see 
> I also include setting ExplicitGCInvokesConcurrent to >>> false, because 
> if that flag is set a full gc will be turned into >>> a concurrent one. 
>  >>> >>> src/hotspot/share/memory/metaspaceShared.cpp >>> 
> -------------------------------------------- 1706 #if >>> 
> INCLUDE_CDS_JAVA_HEAP 1707 if (UseG1GC && >>> 
> HeapShared::is_heap_object_archiving_allowed()) { 1708 // >>> Avoid 
> fragmentation while archiving heap objects. 1709 >>> 
> Universe::heap()->soft_ref_policy()->set_should_clear_all_soft_refs(true); 
>  >>> >>> 1710 Universe::heap()->collect(GCCause::_java_lang_system_gc);
>>>> 1711  >>> 
> Universe::heap()->soft_ref_policy()->set_should_clear_all_soft_refs(false); 
>  >>> >>> 1712       G1HeapVerifier::verify_ready_for_archiving();
>>>> 1713     } 1714 #endif  >>> >>> Do you need the #if here, is_heap_object_archiving_allowed() 
> only >>> returns true when INCLUDE_CDS_JAVA_HEAP is true. Also the check 
>  >>> for UseG1GC is also done within >>> 
> is_heap_object_archiving_allowed(). >>> >>> Regarding triggering the GC, 
> the calls to the soft_ref_policy is >>> not needed unless you care about 
> clearing soft references. >> >> I removed the #if. I kept the clearing 
> of soft refs -- soft refs >> are not archived, so clearing them would 
> give more free space for >> archiving. >> >>> An even nicer solution 
> would be to add a new GCCause that is only >>> used for CDS dumping, 
> that way you can also get a better message >>> in the GC log than 
> "System.gc()" and you can also skip the >>> setting of 
> ExplicitGCInvokesConcurrent that I mentioned above. >>> Since the new 
> GCCause won't be affected by that flag. >>> >> >> I added 
> GCCause::_cds_dump and skipped the setting of >> 
> ExplicitGCInvokesConcurrent >> >> Thanks - Ioi >> >>> Thanks, Stefan >>> 
>  >>>> https://bugs.openjdk.java.net/browse/JDK-8214388 >>>> >>>> >>>> 
> Symptom: ======== >>>> >>>> "java -Xshare:dump" would intermittently 
> fail with >>>> >>>> Unable to write archive heap ... due to 
> fragmentation. >>>> >>>> This usually happens when you try to dump many 
> classes (e.g. >>>> 10000) with a relatively small heap (e.g., 1g) with a 
> lot of GC >>>> threads (e.g., 24). >>>> >>>> (Example use case -- 
> Eclipse IDE loads 15,000 classes with >>>> 512MB heap.) >>>> >>>> When 
> GC happens during class loading, some old G1 regions may >>>> be placed 
> at the top end of the heap (due to large number of GC >>>> threads). 
>  >>>> >>>> Later, when writing the archived heap, G1 tries to allocate 
>  >>>> contiguous regions from the top end of the heap. This would >>>> 
> fail due to the presence of those old regions. >>>> >>>> >>>> Fix: ==== 
>  >>>> >>>> As suggested by Stefan Johansson, we run a full GC with a 
>  >>>> single GC thread. This guarantees that all old blocks will be >>>> 
> moved to the bottom end of the heap. >>>> >>>> Because there's no API 
> for specifying the number of GC threads >>>> dynamically, and CDS dump 
> time doesn't allocates lots of >>>> objects, I have statically forced 
> the number of threads to 1 in >>>> 
> AdaptiveSizePolicy::calc_active_workers during CDS dump time. >>>> >>>> 
> (This seems like a more direct way than assigning >>>> ParallelGCThreads 
> ...) >>>> >>>> >>>> Notes: ====== >>>> >>>> 1. Humongous regions cannot 
> move. However, currently we don't >>>> do humongous allocations during 
> CDS dump, so we should be fine. >>>> I have added diagnostics warnings 
> so if fragmentation does >>>> happen in the future, the user can find 
> out why. >>>> >>>> 2. Fixed a minor bug in >>>> 
> HeapShared::check_closed_archive_heap_region_object_class >>>> >>>> 3. 
> Fixed a bug in MetaspaceShared::read_extra_data, where the >>>> 
> symbol/strings would be lost due to GC. >>>> >>>> 4. Added stress test 
> to successfully archive about 18MB of >>>> objects with -Xmx64m. This 
> used to fail even with -Xmx512m on a >>>> Solaris box. >>>> >>>> 5. With 
> default CDS archive generation during JDK build time, >>>> -Xmx128m is 
> used. Before this fix, the EDEN region lives at the >>>> top of the heap 
> during CDS dump time, and we end up with a 2MB >>>> gap between the 
> archive regions and the top of the heap. >>>> Because the archive 
> regions cannot move, at run time, using >>>> CDS would reduce the max 
> humongous allocation by 2MB. >>>> >>>> With this fix, the archive 
> regions are now placed at the very >>>> top of the heap, so the gap no 
> longer exists. >>>> >>>> >>>> Tests: ====== >>>> >>>> Running 
> hs-tiers{1-6} for sanity. >>>> >>>> Thanks - Ioi >>>> >>>> >> >