RFR(S) 8214388 CDS dumping fails with java heap fragmentation

Fri Nov 30 22:36:16 UTC 2018

Hi Jiangli,

Thanks for the review!

Updated webrev: 
http://cr.openjdk.java.net/~iklam/jdk12/8214388-dumptime-fragmentation.v03-delta/

On 11/30/18 1:04 PM, Jiangli Zhou wrote:
> Hi Ioi,  > > - src/hotspot/share/gc/g1/g1Arguments.cpp > > You missed 
FLAG_SET_ERGO(bool, ExplicitGCInvokesConcurrent, false);

As Stefan mentioned below, doing this is not necessary because I have 
added a new GC cause.

>  > - src/hotspot/share/gc/g1/g1CollectedHeap.cpp > > - 
src/hotspot/share/gc/shared/adaptiveSizePolicy.cpp > > Above files have 
no change.

I had multiple MQ patches that cancel each other out. Webrev still 
generates an empty diff for those files. Please ignore them.

>  > - src/hotspot/share/gc/g1/g1HeapVerifier.cpp > > 320 log_warning(gc, 
verify)("All free regions should be at the > top end of the heap, but " 
321 " we found > holes. This is probably caused by (unmovable) 
humongous" 322 > " allocations, and may lead to fragmentation while" 323 
 > " writing archive heap memory regions."); > > This warning is not 
very clear and can be false alarming. The hole in > the middle of the 
heap causes heap fragmentation, but not necessarily > in the archived 
regions. I'd leave the warning out and let the > existing code in 
FileMapInfo::write_archive_heap_regions to detect > archive 
fragmentation and report error. >
This warning is about a *possibility* of fragmentation. It gives the 
user a chance to check what's going on, without having to wait for an 
actual failure to happen.

In practice, this means if the JDK start-up code is changed in the 
future to allocate a humongous array, someone will discover this while 
building the JDK and dumping the default archive. That way, we can fix 
the JDK to avoid the use of humongous arrays.

In fact, I just modified the code to warn against humongous regions even 
if no holes are found.

> - src/hotspot/share/gc/shared/gcCause.cpp  > > I'd suggest renaming _cds_dump to _archive_time_gc. The word > 
'archive' has been adapted to be used in GC code. >

Done

> - src/hotspot/share/memory/filemap.cpp  > > I see no change for line 686? >
A space character is missing after the word "fragmentation."

> - src/hotspot/share/memory/heapShared.cpp  > > Please revert the renaming for > 
HeapShared::check_closed_archive_heap_region_object. > 
HeapShared::check_closed_archive_heap_region_object_class is not > 
correct. We check the objects in the graph, not the classes. It's > 
better to use a separate bug and changeset for the bug fix in this > 
function. So it's not mixed with the change for 8214388 for better > 
back-tracking.
I'll revert the whole change and create a new bug.

I found the name check_closed_archive_heap_region_object confusing 
because it says "object" but passes a class. I think it's better to pass 
the object, and then have this function read the class from the object. 
I'll send a separate RFR.

>  > - src/hotspot/share/memory/metaspaceShared.cpp > > The G1 specific 
code needs to be wrapped with #if INCLUDE_G1GC. >
Done

> 391           log_warning(cds)("[line %d] extra interned string  > ignored; size too large: %d", 392 > reader.last_line_no(), 
utf8_length); > > You can use cds+heap for object archiving related 
logging. >
Done

> - 1718     if (UseG1GC &&  > HeapShared::is_heap_object_archiving_allowed()) { > > As Stefan also 
pointed out, you don't need UseG1GC check here. > 
HeapShared::is_heap_object_archiving_allowed already includes it. >
Fixed

> G1HeapVerifier::verify_ready_for_archiving() is called into two  > places. One is in HeapShared::archive_java_heap_objects and the 
other > one is in HeapShared::archive_java_heap_objects. The duplication 
 > should be eliminated.
I removed the one inside metaspaceShared.cpp

> I also have the question for the need for  > G1HeapVerifier::verify_ready_for_archiving(). As there is already > 
more efficient check in FileMapInfo::write_archive_heap_regions to > 
detect fragmented archive regions, verify_ready_for_archiving does > not 
add additional values. It's also a more expensive check. >
It's suppose to provide better logging for diagnosing future 
fragmentations (much easier to read than tracing all the gc+region 
logs.). I don't know about you, but for me trying to find out why 
fragmentation happened in JDK-8214217 was pretty damn tedious!

It's not expensive at all since it just scans the list of regions once.

Thanks

- Ioi

> Thanks, Jiangli  > > On 11/30/18 11:02 AM, Ioi Lam wrote: >> Hi Stefan, >> >> Thanks 
for the review. Here's an updated patch: >> >> 
http://cr.openjdk.java.net/~iklam/jdk12/8214388-dumptime-fragmentation.v02/ 
 >> >> 
http://cr.openjdk.java.net/~iklam/jdk12/8214388-dumptime-fragmentation.v02-delta/
>>  >> Please see responses to your comments below. >> >> I also added code 
to handle humongous strings that can be allocated >> during dump time, 
and make sure they are GC'ed (and thus no >> humongous regions) before 
we starting dumping the heap. >> >> I'll add one more test to test the 
handling of humongous regions >> that exist while the heap is being 
dumped, but that will require a >> little hacking with JVMTI. I'll post 
that later :-) >> >> >> On 11/30/18 3:17 AM, Stefan Johansson wrote: >>> 
Hi Ioi, >>> >>> I mostly looked at the GC interaction, some comments 
below. >>> >>> On 2018-11-30 02:07, Ioi Lam wrote: >>>> 
http://cr.openjdk.java.net/~iklam/jdk12/8214388-dumptime-fragmentation.v01/ 
 >>> >>>> src/hotspot/share/gc/g1/g1CollectedHeap.cpp
>>> ------------------------------------------- 325 #if INCLUDE_CDS  >>> 326 if (DumpSharedSpaces && >>> 
HeapShared::is_heap_object_archiving_allowed()) { 327 // See >>> 
G1HeapVerifier::verify_ready_for_archiving() 328 // This >>> should not 
happen during normal operation of -Xshare:dump, so >>> let's give a 
warning. 329 log_warning(gc)("Avoid allocating >>> humongous objects 
during -Xshare:dump (" SIZE_FORMAT 330 >>> " bytes) - may cause 
fragmentation", 331 >>> word_size * HeapWordSize); 332 } 333 #endif >>> 
 >>> I think this warning should be left out, you still have output in 
 >>> the verification that tells the user what is problematic. >>> >> >> 
Removed. >> >>> /src/hotspot/share/gc/g1/g1HeapVerifier.cpp >>> 
------------------------------------------- What do you think >>> about 
extending the verification with one more state, maybe >>> _fatal_hole. 
When iterating the regions you can then look for >>> holes not caused by 
humongous. If such holes are found log an >>> error and fail. >>> >>> 
You can then turn the current log_error into a log_warning >>> letting 
the user know there has been humongous allocations which >>> can cause 
problems during dump time. >>> >>> 314 // turn on the following assert 
only if we disallow >>> humongous allocations during 315 // dump time. 
316 >>> //assert(!cl.has_holes(), "must not have holes"); >>> >>> I 
think you should remove these lines. >>> >> >> I modified the 
verification loop as you suggested, and changed the >> assert to check 
that all holes must be caused by humongous >> regions. >> >>> 
src/hotspot/share/gc/shared/adaptiveSizePolicy.cpp >>> 
-------------------------------------------------- 184 #if >>> 
INCLUDE_CDS 185 if (DumpSharedSpaces && >>> 
HeapShared::is_heap_object_archiving_allowed()) { 186 // See >>> 
G1HeapVerifier::verify_ready_for_archiving() 187 return 1; >>> 188 } 189 
#endif 190 >>> >>> I think this decision should be moved to the g1 
argument parsing. >>> In G1Arguments::initialize() you can add this 
after the initial >>> setting of ParallelGCThreads. >>> >>> // When 
dumping the CDS archive we want to reduce fragmentation >>> by // 
triggering a full collection. To get as low fragmentation >>> as // 
possible we only use one worker thread. if >>> (DumpSharedSpaces) { 
FLAG_SET_ERGO(uint, ParallelGCThreads, 1); >>> FLAG_SET_ERGO(bool, 
ExplicitGCInvokesConcurrent, false); } >>> >> >> Done. >> >>> As you see 
I also include setting ExplicitGCInvokesConcurrent to >>> false, because 
if that flag is set a full gc will be turned into >>> a concurrent one. 
 >>> >>> src/hotspot/share/memory/metaspaceShared.cpp >>> 
-------------------------------------------- 1706 #if >>> 
INCLUDE_CDS_JAVA_HEAP 1707 if (UseG1GC && >>> 
HeapShared::is_heap_object_archiving_allowed()) { 1708 // >>> Avoid 
fragmentation while archiving heap objects. 1709 >>> 
Universe::heap()->soft_ref_policy()->set_should_clear_all_soft_refs(true); 
 >>> >>> 1710 Universe::heap()->collect(GCCause::_java_lang_system_gc);
>>> 1711  >>> 
Universe::heap()->soft_ref_policy()->set_should_clear_all_soft_refs(false); 
 >>> >>> 1712       G1HeapVerifier::verify_ready_for_archiving();
>>> 1713     } 1714 #endif  >>> >>> Do you need the #if here, is_heap_object_archiving_allowed() 
only >>> returns true when INCLUDE_CDS_JAVA_HEAP is true. Also the check 
 >>> for UseG1GC is also done within >>> 
is_heap_object_archiving_allowed(). >>> >>> Regarding triggering the GC, 
the calls to the soft_ref_policy is >>> not needed unless you care about 
clearing soft references. >> >> I removed the #if. I kept the clearing 
of soft refs -- soft refs >> are not archived, so clearing them would 
give more free space for >> archiving. >> >>> An even nicer solution 
would be to add a new GCCause that is only >>> used for CDS dumping, 
that way you can also get a better message >>> in the GC log than 
"System.gc()" and you can also skip the >>> setting of 
ExplicitGCInvokesConcurrent that I mentioned above. >>> Since the new 
GCCause won't be affected by that flag. >>> >> >> I added 
GCCause::_cds_dump and skipped the setting of >> 
ExplicitGCInvokesConcurrent >> >> Thanks - Ioi >> >>> Thanks, Stefan >>> 
 >>>> https://bugs.openjdk.java.net/browse/JDK-8214388 >>>> >>>> >>>> 
Symptom: ======== >>>> >>>> "java -Xshare:dump" would intermittently 
fail with >>>> >>>> Unable to write archive heap ... due to 
fragmentation. >>>> >>>> This usually happens when you try to dump many 
classes (e.g. >>>> 10000) with a relatively small heap (e.g., 1g) with a 
lot of GC >>>> threads (e.g., 24). >>>> >>>> (Example use case -- 
Eclipse IDE loads 15,000 classes with >>>> 512MB heap.) >>>> >>>> When 
GC happens during class loading, some old G1 regions may >>>> be placed 
at the top end of the heap (due to large number of GC >>>> threads). 
 >>>> >>>> Later, when writing the archived heap, G1 tries to allocate 
 >>>> contiguous regions from the top end of the heap. This would >>>> 
fail due to the presence of those old regions. >>>> >>>> >>>> Fix: ==== 
 >>>> >>>> As suggested by Stefan Johansson, we run a full GC with a 
 >>>> single GC thread. This guarantees that all old blocks will be >>>> 
moved to the bottom end of the heap. >>>> >>>> Because there's no API 
for specifying the number of GC threads >>>> dynamically, and CDS dump 
time doesn't allocates lots of >>>> objects, I have statically forced 
the number of threads to 1 in >>>> 
AdaptiveSizePolicy::calc_active_workers during CDS dump time. >>>> >>>> 
(This seems like a more direct way than assigning >>>> ParallelGCThreads 
...) >>>> >>>> >>>> Notes: ====== >>>> >>>> 1. Humongous regions cannot 
move. However, currently we don't >>>> do humongous allocations during 
CDS dump, so we should be fine. >>>> I have added diagnostics warnings 
so if fragmentation does >>>> happen in the future, the user can find 
out why. >>>> >>>> 2. Fixed a minor bug in >>>> 
HeapShared::check_closed_archive_heap_region_object_class >>>> >>>> 3. 
Fixed a bug in MetaspaceShared::read_extra_data, where the >>>> 
symbol/strings would be lost due to GC. >>>> >>>> 4. Added stress test 
to successfully archive about 18MB of >>>> objects with -Xmx64m. This 
used to fail even with -Xmx512m on a >>>> Solaris box. >>>> >>>> 5. With 
default CDS archive generation during JDK build time, >>>> -Xmx128m is 
used. Before this fix, the EDEN region lives at the >>>> top of the heap 
during CDS dump time, and we end up with a 2MB >>>> gap between the 
archive regions and the top of the heap. >>>> Because the archive 
regions cannot move, at run time, using >>>> CDS would reduce the max 
humongous allocation by 2MB. >>>> >>>> With this fix, the archive 
regions are now placed at the very >>>> top of the heap, so the gap no 
longer exists. >>>> >>>> >>>> Tests: ====== >>>> >>>> Running 
hs-tiers{1-6} for sanity. >>>> >>>> Thanks - Ioi >>>> >>>> >> >