RFR(S) 8245925 G1 allocates EDEN region after CDS has executed GC

Tue Jun 2 23:01:48 UTC 2020

On Mon, Jun 1, 2020 at 7:28 PM Ioi Lam <ioi.lam at oracle.com> wrote:
>
>
>
> On 5/31/20 8:33 PM, Jiangli Zhou wrote:
> > On Sun, May 31, 2020 at 8:27 PM Jiangli Zhou <jianglizhou at google.com> wrote:
> >> On Fri, May 29, 2020 at 10:44 PM Ioi Lam <ioi.lam at oracle.com> wrote:
> >>>
> >>>
> >>> On 5/29/20 8:40 PM, Jiangli Zhou wrote:
> >>>> On Fri, May 29, 2020 at 7:30 PM Ioi Lam <ioi.lam at oracle.com> wrote:
> >>>>> https://bugs.openjdk.java.net/browse/JDK-8245925
> >>>>> http://cr.openjdk.java.net/~iklam/jdk15/8245925-g1-eden-after-cds-gc.v01/
> >>>>>
> >>>>>
> >>>>> Summary:
> >>>>>
> >>>>> CDS supports archived heap objects only for G1. During -Xshare:dump,
> >>>>> CDS executes a full GC so that G1 will compact the heap regions, leaving
> >>>>> maximum contiguous free space at the top of the heap. Then, the archived
> >>>>> heap regions are allocated from the top of the heap.
> >>>>>
> >>>>> Under some circumstances, java.lang.ref.Cleaners will execute
> >>>>> after the GC has completed. The cleaners may allocate or synchronized, which
> >>>>> will cause G1 to allocate an EDEN region at the top of the heap.
> >>>> This is an interesting one. Please give more details on under what
> >>>> circumstances java.lang.ref.Cleaners causes the issue. It's unclear to
> >>>> me why it hasn't been showing up before.
> >>> Hi Jiangli,
> >>>
> >>> Thanks for the review. It's very helpful.
> >>>
> >>> The assert (see my comment in JDK-8245925) happened in my prototype for
> >>> JDK-8244778
> >>>
> >>> http://cr.openjdk.java.net/~iklam/jdk15/8244778-archive-full-module-graph.v00.8/
> >>>
> >>> I have to archive AppClassLoader and PlatformClassLoader, but need to
> >>> clear their URLClassPath field (the "ucp" field). See
> >>> clear_loader_states in metaspaceShared.cpp. Because of this, some
> >>> java.util.zip.ZipFiles referenced by the URLClassPath  become garbage,
> >>> and their Cleaners are executed after full GC has finished.
> >>>
> >> I haven't looked at your 8244778-archive-full-module-graph change yet,
> >> if you are going to archive class loader objects, you probably want to
> >> go with a solution that scrubs fields that are 'not archivable' and
> >> then restores at runtime. Sounds like you are going with that. When I
> >> worked on the initial implementation for system module object
> >> archiving, I implemented static field scrubber with the goal for
> >> archiving class loaders. I didn't complete it as it was not yet
> >> needed, but the code probably is helpful for you now. I might have
> >> sent you the pointer to one of the versions at the time, but try
> >> looking under my old /home directory if it's still around. It might be
> >> good to trigger runtime field restoration by Java code, that's the
> >> part I haven't fully explored yet. But, hopefully these inputs would
> >> be useful for your current work.
>
> Hi Jiangli,
>
> I can't access your old home directory. I already implemented the field
> scrubbing in the above patch. It's in clear_loader_states in
> metaspaceShared.cpp.
>

I'll look at it after it's ready for code review.

> >>> I think the bug has always existed, but is just never triggered because
> >>> we have not activated the Cleaners.
> >>>
> >>>>> The fix is simple -- after CDS has entered a safepoint, if EDEN regions
> >>>>> exist,
> >>>>> exit the safepoint, run GC, and try again. Eventually all the cleaners will
> >>>>> be executed and no more allocation can happen.
> >>>>>
> >>>>> For safety, I limit the retry count to 30 (or about total 9 seconds).
> >>>>>
> >>>> I think it's better to skip the top allocated region(s) in such cases
> >>>> and avoid retrying. Dump time performance is important, as we are
> >>>> moving the cost from runtime to CDS dump time. It's desirable to keep
> >>>> the dump time cost as low as possible, so using CDS delivers better
> >>>> net gain overall.
> >>>>
> >>>> Here are some comments for your current webrev itself.
> >>>>
> >>>> 1611 static bool has_unwanted_g1_eden_regions() {
> >>>> 1612 #if INCLUDE_G1GC
> >>>> 1613   return HeapShared::is_heap_object_archiving_allowed() && UseG1GC &&
> >>>> 1614          G1CollectedHeap::heap()->eden_regions_count() > 0;
> >>>> 1615 #else
> >>>> 1616   return false;
> >>>> 1617 #endif
> >>>> 1618 }
> >>>>
> >>>> You can remove 'UseG1GC' from line 1613, as
> >>>> is_heap_object_archiving_allowed() check already covers it:
> >>>>
> >>>> static bool is_heap_object_archiving_allowed() {
> >>>>     CDS_JAVA_HEAP_ONLY(return (UseG1GC && UseCompressedOops &&
> >>>> UseCompressedClassPointers);)
> >>>>     NOT_CDS_JAVA_HEAP(return false;)
> >>>> }
> >>>>
> >>>> Please include heap archiving code under #if INCLUDE_CDS_JAVA_HEAP.
> >>>> It's better to extract the GC handling code in
> >>>> VM_PopulateDumpSharedSpace::doit() into a separate API in
> >>>> heapShared.*.
> >>>>
> >>>> It's time to enhance heap archiving to use a separate buffer when
> >>>> copying the objects at dump time (discussed before), as a longer term
> >>>> fix. I'll file a RFE.
> >>> Thanks for reminding me. I think that is a better way to fix this
> >>> problem. It should be fairly easy to do, as we can already relocate the
> >>> heap regions using HeapShared::patch_archived_heap_embedded_pointers().
> >>> Let me try to implement it.
> >>>
> >> Sounds good. Thanks for doing that.
> >>
> >>> BTW, the GC speed is very fast, because the heap is not used very much
> >>> during -Xshare:dump. -Xlog:gc shows:
> >>>
> >>> [0.259s][info][gc ] GC(0) Pause Full (Full GC for -Xshare:dump)
> >>> 4M->1M(32M) 8.220ms
> >>>
> >>> So we have allocated only 4MB of objects, and only 1MB of those are
> >>> reachable.
> >>>
> >>> Anyway, I think we can even avoid running the GC altogether. We can scan
> >>> for contiguous free space from the top of the heap (below the EDEN
> >>> region). If there's more contiguous free space than the current
> >>> allocated heap regions, we know for sure that we can archive all the
> >>> heap objects that we need without doing the GC. That can be done as an
> >>> RFE after copying the objects. It won't save much though (8ms out of
> >>> about 700ms of total -Xshare:dump time).
> >> Looks like we had similar thoughts about finding free heap regions for
> >> copying. Here are the details:
> >>
> >> Solution 1):
> >> Allocate a buffer (no specific memory location requirement) and copy
> >> the heap objects to the buffer. Additional buffers can be allocated
> >> when needed, and they don't need to form a consecutive block of
> >> memory. The pointers within the copied Java objects need to be
> >> computed from the Java heap top as the 'real heap address' (so their
> >> runtime positions are at the heap top), instead of the buffer address.
> >> Region verification code also needs to be updated to reflect how the
> >> pointers are computed now.
> >>
> >> Solution 2):
> >> Find a range (consecutive heap regions) within the Java heap that is
> >> free. Copy the archive objects to that range. The pointer update and
> >> verification are similar to solution 1).
> >>
> >>
>
> I am thinking of doing something simpler. During heap archiving, G1 just
> allocates the highest free region
>
> bool G1ArchiveAllocator::alloc_new_region() {
>    // Allocate the highest free region in the reserved heap,
>    // and add it to our list of allocated regions. It is marked
>    // archive and added to the old set.
>    HeapRegion* hr = _g1h->alloc_highest_free_region();
>
> If there are used regions scattered around in the heap, we will end up
> with a few archive regions that are not contiguous, and the highest
> archive region may not be flushed to the top of the heap.
>
> Inside VM_PopulateDumpSharedSpace::dump_archive_heap_oopmaps, I will
> rewrite the pointers inside each of the archived regions, such that the
> contents would be the same as if all the archive regions were
> consecutively allocated at the top of the heap.

One complication is the archive region verification. You would need to
adjust that as well.

I've created JDK-8246297.

>
> This will require no more memory allocation at dump time that what it
> does today, and can be done with very little overhead.
>
> > I think you can go with solution 1 now. Solution 2) has the benefit of
> > not requiring additional memory for copying archived objects. That's
> > important as I did run into insufficient memory at dump time in real
> > use cases, so any memory saving at dump time is desirable.
> >
> > Some clarifications to avoid confusion: The insufficient memory is due
> > to memory restriction for builds in cloud environment.
> Could you elaborate? The Java heap space is reserved but initially not
> committed. Physical memory is allocated when we write into the archive
> heap regions.

When -Xms and -Xmx is set to be the same, all heap memory is committed
upfront, as far as I can tell. Setting -Xms and -Xmx to the same is a
very common use case.

>
> Were you getting OOM during virtual space reservation, or during
> os::commit_memory()?
>
> How much heap objects are you dumping? The current jdk repo needs only 2
> G1 regions, so it's 2* 4M of memory for small heaps like -Xmx128m, and 2
> * 8M of memory for larger heaps.

It is the total available memory, which has a limit. When archiving a
very large number of classes, the total required memory (including
Java heap, metaspace, archive space, etc) during the dumping process
can exceed the memory limit.

Best,
Jiangli
>
> Thanks
> - Ioi
>
> > Thanks,
> > Jiangli
> >
> >> It's better
> >> to go with 2) when the size of archive range is known before copying.
> >> With the planned work for class pre-initialization and enhanced object
> >> archiving support, it will be able to obtain (or have a good estimate
> >> of) the total size before copying. Solution 1 can be enhanced to use
> >> heap memory when that happens.
> >>
> >> I'll log these details as a RFE on Monday.
> >>
> >> Best,
> >> Jiangli
> >>
> >>> I'll withdraw this patch for now, and will try to implement the object
> >>> copying.
> >>>
> >>> Thanks
> >>> - Ioi
> >>>
> >>>> Best,
> >>>> Jiangli
> >>>>
> >>>>> Thanks
> >>>>> - Ioi
> >>>>>
> >>>>>
> >>>>> <https://bugs.openjdk.java.net/browse/JDK-8245925>
>