RFR(S) 8214388 CDS dumping fails with java heap fragmentation

Mon Dec 3 23:58:50 UTC 2018

Hi Stefan, thanks for the review.

- Ioi

On 12/3/2018 1:28 AM, Stefan Johansson wrote:
> Hi Ioi,
>
> Thanks for addressing my comments I think this version looks good from 
> a G1 point of view.
>
> Thanks,
> Stefan
>
> On 2018-11-30 23:36, Ioi Lam wrote:
>> Hi Jiangli,
>>
>> Thanks for the review!
>>
>>
>> Updated webrev: 
>> http://cr.openjdk.java.net/~iklam/jdk12/8214388-dumptime-fragmentation.v03-delta/
>>
>>
>> On 11/30/18 1:04 PM, Jiangli Zhou wrote:
>>> Hi Ioi,  > > - src/hotspot/share/gc/g1/g1Arguments.cpp > > You missed 
>> FLAG_SET_ERGO(bool, ExplicitGCInvokesConcurrent, false);
>>
>> As Stefan mentioned below, doing this is not necessary because I have 
>> added a new GC cause.
>>
>>>  > - src/hotspot/share/gc/g1/g1CollectedHeap.cpp > > - 
>> src/hotspot/share/gc/shared/adaptiveSizePolicy.cpp > > Above files 
>> have no change.
>>
>> I had multiple MQ patches that cancel each other out. Webrev still 
>> generates an empty diff for those files. Please ignore them.
>>
>>>  > - src/hotspot/share/gc/g1/g1HeapVerifier.cpp > > 320 log_warning(gc, 
>> verify)("All free regions should be at the > top end of the heap, but 
>> " 321 " we found > holes. This is probably caused by (unmovable) 
>> humongous" 322 > " allocations, and may lead to fragmentation while" 
>> 323  > " writing archive heap memory regions."); > > This warning is 
>> not very clear and can be false alarming. The hole in > the middle of 
>> the heap causes heap fragmentation, but not necessarily > in the 
>> archived regions. I'd leave the warning out and let the > existing 
>> code in FileMapInfo::write_archive_heap_regions to detect > archive 
>> fragmentation and report error. >
>> This warning is about a *possibility* of fragmentation. It gives the 
>> user a chance to check what's going on, without having to wait for an 
>> actual failure to happen.
>>
>> In practice, this means if the JDK start-up code is changed in the 
>> future to allocate a humongous array, someone will discover this 
>> while building the JDK and dumping the default archive. That way, we 
>> can fix the JDK to avoid the use of humongous arrays.
>>
>> In fact, I just modified the code to warn against humongous regions 
>> even if no holes are found.
>>
>>
>>> - src/hotspot/share/gc/shared/gcCause.cpp  > > I'd suggest renaming 
>>> _cds_dump to _archive_time_gc. The word > 
>> 'archive' has been adapted to be used in GC code. >
>>
>> Done
>>
>>
>>> - src/hotspot/share/memory/filemap.cpp > > I see no change for line 
>>> 686? >
>> A space character is missing after the word "fragmentation."
>>
>>
>>> - src/hotspot/share/memory/heapShared.cpp  > > Please revert the 
>>> renaming for > 
>> HeapShared::check_closed_archive_heap_region_object. > 
>> HeapShared::check_closed_archive_heap_region_object_class is not > 
>> correct. We check the objects in the graph, not the classes. It's > 
>> better to use a separate bug and changeset for the bug fix in this > 
>> function. So it's not mixed with the change for 8214388 for better > 
>> back-tracking.
>> I'll revert the whole change and create a new bug.
>>
>> I found the name check_closed_archive_heap_region_object confusing 
>> because it says "object" but passes a class. I think it's better to 
>> pass the object, and then have this function read the class from the 
>> object. I'll send a separate RFR.
>>
>>>  > - src/hotspot/share/memory/metaspaceShared.cpp > > The G1 specific 
>> code needs to be wrapped with #if INCLUDE_G1GC. >
>> Done
>>
>>
>>> 391           log_warning(cds)("[line %d] extra interned string  > 
>>> ignored; size too large: %d", 392 > reader.last_line_no(), 
>> utf8_length); > > You can use cds+heap for object archiving related 
>> logging. >
>> Done
>>
>>
>>> - 1718     if (UseG1GC &&  > 
>>> HeapShared::is_heap_object_archiving_allowed()) { > > As Stefan also 
>> pointed out, you don't need UseG1GC check here. > 
>> HeapShared::is_heap_object_archiving_allowed already includes it. >
>> Fixed
>>
>>
>>> G1HeapVerifier::verify_ready_for_archiving() is called into two  > 
>>> places. One is in HeapShared::archive_java_heap_objects and the 
>> other > one is in HeapShared::archive_java_heap_objects. The 
>> duplication  > should be eliminated.
>> I removed the one inside metaspaceShared.cpp
>>
>>
>>> I also have the question for the need for  > 
>>> G1HeapVerifier::verify_ready_for_archiving(). As there is already > 
>> more efficient check in FileMapInfo::write_archive_heap_regions to > 
>> detect fragmented archive regions, verify_ready_for_archiving does > 
>> not add additional values. It's also a more expensive check. >
>> It's suppose to provide better logging for diagnosing future 
>> fragmentations (much easier to read than tracing all the gc+region 
>> logs.). I don't know about you, but for me trying to find out why 
>> fragmentation happened in JDK-8214217 was pretty damn tedious!
>>
>> It's not expensive at all since it just scans the list of regions once.
>>
>> Thanks
>>
>> - Ioi
>>
>>
>>> Thanks, Jiangli  > > On 11/30/18 11:02 AM, Ioi Lam wrote: >> Hi 
>>> Stefan, >> >> Thanks 
>> for the review. Here's an updated patch: >> >> 
>> http://cr.openjdk.java.net/~iklam/jdk12/8214388-dumptime-fragmentation.v02/ 
>>  >> >> 
>> http://cr.openjdk.java.net/~iklam/jdk12/8214388-dumptime-fragmentation.v02-delta/
>>>>  >> Please see responses to your comments below. >> >> I also added 
>>>> code 
>> to handle humongous strings that can be allocated >> during dump 
>> time, and make sure they are GC'ed (and thus no >> humongous regions) 
>> before we starting dumping the heap. >> >> I'll add one more test to 
>> test the handling of humongous regions >> that exist while the heap 
>> is being dumped, but that will require a >> little hacking with 
>> JVMTI. I'll post that later :-) >> >> >> On 11/30/18 3:17 AM, Stefan 
>> Johansson wrote: >>> Hi Ioi, >>> >>> I mostly looked at the GC 
>> interaction, some comments below. >>> >>> On 2018-11-30 02:07, Ioi 
>> Lam wrote: >>>> 
>> http://cr.openjdk.java.net/~iklam/jdk12/8214388-dumptime-fragmentation.v01/ 
>>  >>> >>>> src/hotspot/share/gc/g1/g1CollectedHeap.cpp
>>>>> ------------------------------------------- 325 #if INCLUDE_CDS  
>>>>> >>> 326 if (DumpSharedSpaces && >>> 
>> HeapShared::is_heap_object_archiving_allowed()) { 327 // See >>> 
>> G1HeapVerifier::verify_ready_for_archiving() 328 // This >>> should 
>> not happen during normal operation of -Xshare:dump, so >>> let's give 
>> a warning. 329 log_warning(gc)("Avoid allocating >>> humongous 
>> objects during -Xshare:dump (" SIZE_FORMAT 330 >>> " bytes) - may 
>> cause fragmentation", 331 >>> word_size * HeapWordSize); 332 } 333 
>> #endif >>>  >>> I think this warning should be left out, you still 
>> have output in  >>> the verification that tells the user what is 
>> problematic. >>> >> >> Removed. >> >>> 
>> /src/hotspot/share/gc/g1/g1HeapVerifier.cpp >>> 
>> ------------------------------------------- What do you think >>> 
>> about extending the verification with one more state, maybe >>> 
>> _fatal_hole. When iterating the regions you can then look for >>> 
>> holes not caused by humongous. If such holes are found log an >>> 
>> error and fail. >>> >>> You can then turn the current log_error into 
>> a log_warning >>> letting the user know there has been humongous 
>> allocations which >>> can cause problems during dump time. >>> >>> 
>> 314 // turn on the following assert only if we disallow >>> humongous 
>> allocations during 315 // dump time. 316 >>> 
>> //assert(!cl.has_holes(), "must not have holes"); >>> >>> I think you 
>> should remove these lines. >>> >> >> I modified the verification loop 
>> as you suggested, and changed the >> assert to check that all holes 
>> must be caused by humongous >> regions. >> >>> 
>> src/hotspot/share/gc/shared/adaptiveSizePolicy.cpp >>> 
>> -------------------------------------------------- 184 #if >>> 
>> INCLUDE_CDS 185 if (DumpSharedSpaces && >>> 
>> HeapShared::is_heap_object_archiving_allowed()) { 186 // See >>> 
>> G1HeapVerifier::verify_ready_for_archiving() 187 return 1; >>> 188 } 
>> 189 #endif 190 >>> >>> I think this decision should be moved to the 
>> g1 argument parsing. >>> In G1Arguments::initialize() you can add 
>> this after the initial >>> setting of ParallelGCThreads. >>> >>> // 
>> When dumping the CDS archive we want to reduce fragmentation >>> by 
>> // triggering a full collection. To get as low fragmentation >>> as 
>> // possible we only use one worker thread. if >>> (DumpSharedSpaces) 
>> { FLAG_SET_ERGO(uint, ParallelGCThreads, 1); >>> FLAG_SET_ERGO(bool, 
>> ExplicitGCInvokesConcurrent, false); } >>> >> >> Done. >> >>> As you 
>> see I also include setting ExplicitGCInvokesConcurrent to >>> false, 
>> because if that flag is set a full gc will be turned into >>> a 
>> concurrent one.  >>> >>> src/hotspot/share/memory/metaspaceShared.cpp 
>> >>> -------------------------------------------- 1706 #if >>> 
>> INCLUDE_CDS_JAVA_HEAP 1707 if (UseG1GC && >>> 
>> HeapShared::is_heap_object_archiving_allowed()) { 1708 // >>> Avoid 
>> fragmentation while archiving heap objects. 1709 >>> 
>> Universe::heap()->soft_ref_policy()->set_should_clear_all_soft_refs(true); 
>>  >>> >>> 1710 Universe::heap()->collect(GCCause::_java_lang_system_gc);
>>>>> 1711  >>> 
>> Universe::heap()->soft_ref_policy()->set_should_clear_all_soft_refs(false); 
>>  >>> >>> 1712 G1HeapVerifier::verify_ready_for_archiving();
>>>>> 1713     } 1714 #endif  >>> >>> Do you need the #if here, 
>>>>> is_heap_object_archiving_allowed() 
>> only >>> returns true when INCLUDE_CDS_JAVA_HEAP is true. Also the 
>> check  >>> for UseG1GC is also done within >>> 
>> is_heap_object_archiving_allowed(). >>> >>> Regarding triggering the 
>> GC, the calls to the soft_ref_policy is >>> not needed unless you 
>> care about clearing soft references. >> >> I removed the #if. I kept 
>> the clearing of soft refs -- soft refs >> are not archived, so 
>> clearing them would give more free space for >> archiving. >> >>> An 
>> even nicer solution would be to add a new GCCause that is only >>> 
>> used for CDS dumping, that way you can also get a better message >>> 
>> in the GC log than "System.gc()" and you can also skip the >>> 
>> setting of ExplicitGCInvokesConcurrent that I mentioned above. >>> 
>> Since the new GCCause won't be affected by that flag. >>> >> >> I 
>> added GCCause::_cds_dump and skipped the setting of >> 
>> ExplicitGCInvokesConcurrent >> >> Thanks - Ioi >> >>> Thanks, Stefan 
>> >>>  >>>> https://bugs.openjdk.java.net/browse/JDK-8214388 >>>> >>>> 
>> >>>> Symptom: ======== >>>> >>>> "java -Xshare:dump" would 
>> intermittently fail with >>>> >>>> Unable to write archive heap ... 
>> due to fragmentation. >>>> >>>> This usually happens when you try to 
>> dump many classes (e.g. >>>> 10000) with a relatively small heap 
>> (e.g., 1g) with a lot of GC >>>> threads (e.g., 24). >>>> >>>> 
>> (Example use case -- Eclipse IDE loads 15,000 classes with >>>> 512MB 
>> heap.) >>>> >>>> When GC happens during class loading, some old G1 
>> regions may >>>> be placed at the top end of the heap (due to large 
>> number of GC >>>> threads).  >>>> >>>> Later, when writing the 
>> archived heap, G1 tries to allocate  >>>> contiguous regions from the 
>> top end of the heap. This would >>>> fail due to the presence of 
>> those old regions. >>>> >>>> >>>> Fix: ====  >>>> >>>> As suggested 
>> by Stefan Johansson, we run a full GC with a  >>>> single GC thread. 
>> This guarantees that all old blocks will be >>>> moved to the bottom 
>> end of the heap. >>>> >>>> Because there's no API for specifying the 
>> number of GC threads >>>> dynamically, and CDS dump time doesn't 
>> allocates lots of >>>> objects, I have statically forced the number 
>> of threads to 1 in >>>> AdaptiveSizePolicy::calc_active_workers 
>> during CDS dump time. >>>> >>>> (This seems like a more direct way 
>> than assigning >>>> ParallelGCThreads ...) >>>> >>>> >>>> Notes: 
>> ====== >>>> >>>> 1. Humongous regions cannot move. However, currently 
>> we don't >>>> do humongous allocations during CDS dump, so we should 
>> be fine. >>>> I have added diagnostics warnings so if fragmentation 
>> does >>>> happen in the future, the user can find out why. >>>> >>>> 
>> 2. Fixed a minor bug in >>>> 
>> HeapShared::check_closed_archive_heap_region_object_class >>>> >>>> 
>> 3. Fixed a bug in MetaspaceShared::read_extra_data, where the >>>> 
>> symbol/strings would be lost due to GC. >>>> >>>> 4. Added stress 
>> test to successfully archive about 18MB of >>>> objects with -Xmx64m. 
>> This used to fail even with -Xmx512m on a >>>> Solaris box. >>>> >>>> 
>> 5. With default CDS archive generation during JDK build time, >>>> 
>> -Xmx128m is used. Before this fix, the EDEN region lives at the >>>> 
>> top of the heap during CDS dump time, and we end up with a 2MB >>>> 
>> gap between the archive regions and the top of the heap. >>>> Because 
>> the archive regions cannot move, at run time, using >>>> CDS would 
>> reduce the max humongous allocation by 2MB. >>>> >>>> With this fix, 
>> the archive regions are now placed at the very >>>> top of the heap, 
>> so the gap no longer exists. >>>> >>>> >>>> Tests: ====== >>>> >>>> 
>> Running hs-tiers{1-6} for sanity. >>>> >>>> Thanks - Ioi >>>> >>>> >> >