Thoughts about SubstrateVM GC

Fri Mar 1 10:16:16 UTC 2019

Hi Christian,

thanks for your replies. This is very interesting. Some additional
comments below inline:

>> The old generation is 2 semispaces (actually, 4 with the 2 pinned
>> spaces, which I'll ask about later). When collected, live objects get
>> scavenged back-and-forth between the two spaces.
> 
> yes, in theory. In "reality" there is no contiguous memory for the
> spaces, so after a GC all the aligned chunks of the from-space are
> either returned to the OS or cached and immediately reused for young
> generation allocation.

Aha, ok. This definitely affects future design decisions.

>> - The policy when to start collecting seems a bit unclear to me. In my
>> understanding, there is (almost) no use (for STW GCs) in starting a
>> collection before allocation space is exhausted. Which means, it seems
>> an obvious trigger to start collection on allocation failure. Yet, the
>> policies I'm looking at are time-based or time-and-space-based. I said
>> 'almost' because the single use for time-based collection would be
>> periodic GC that is able to knock out lingering garbage during
>> no/little-allocation phase of an application, and then only when the GC
>> is also uncommitting the resulting unused pages (which, afaics,
>> Substrate GC would do: bravo!). But that doesn't seem to be the point of
>> the time-based policies: it looks like the goal of those policies is to
>> balance time spent in young-vs-old-gen collections.?!
> 
> The young generation size is fixed, and a collection is started when
> this space is filled. So from that point of view, the starting trigger
> of a GC is always space based.
> 
> The policy whether to do an incremental (young) collection or a full
> collection is time based (but you can easily plug in any policy you
> want). The goal is to balance time between incremental and full
> collection. We certainly don't want to fill up the old generation
> because that is the maximum heap size, and it is by default provisioned
> to be very big.

Ok, I see.
Also, the way it's currently done, the old-gen needs to be able to
absorb (worst-case) all of young-gen in next cycle, and therefore needs
*plenty* of headroom. I.e. we need to collect old-gen much earlier than
when it's full (like when remaining free space in old-gen is smaller
than young-gen size). Alternatively, we could exhaust old-gen, and
change young-gen-collection-policy to skip collection if old-gen doesn't
have enough space left, and dive into full-GC right away. Or, even
better, add an intermediate tenuring generation. :-)

>> With a little bit of distance and squinting of eyes, one can see that
>> Substrate GC's young generation is really what is called 'nursery space'
>> elsewhere, which aims to reduce the rate at which objects get introduced
>> into young generation. And the old generation is really what is usually
>> called young generation elsewhere. What's missing is a true old
>> generation space.
> 
> Not really, because the young generation can be collected independently,
> i.e., there are the generational write barriers, remembered sets, ...
> 
> So the young generation is reduced to the nursery space, but I argue the
> old generation is really an old generation.

Ok.

>> Considering all this, I would like to propose some improvements:
>> - Introduce a notion of tenuring objects. I guess we need like 2 age
>> bits in the header or elsewhere for this. Do we have that room?
> 
> You don't need the age bits in the header. You can easily go from the
> object to the aligned chunk that the object is in (we do that all the
> time, for example for the write barrier to do the card marking), and
> store the age in the chunk header. Requiring all objects in one chunk to
> have the same age is not much of a limitation.
Right.

> Adding tenuring is definitely necessary to achieve reasonable GC
> performance.

+1

>> - Implement a true old-space (and rename the existing young to nursery
>> and old to young?). In my experience, sliding/mark-compact collection
>> using a mark bitmap works best for this: it tends to create a 'sediment'
>> of permanent/very-long-lived objects at the bottom which would never get
>> copied again. Using a bitmap, walking of live objects (e.g. during
>> copying, updating etc) would be very fast: much faster than walking
>> objects by their size.
> 
> A mark-and-compact old generation algorithm definitely makes sense.
> Again, the only reason why we don't have it yet is that no one had time
> to implement it.
> 
> Mark-and-compact is also great to reduce memory footprint. Right now,
> during GC the memory footprint can double because of the temporary space
> for copying.

Yeah. However, as Peter noted, having no contiguous memory block
complicates this. I'd need to see how to deal with it (per-chunk-bitmap
probably, or maybe mark bit in object header, with some clever tricks to
make scanning the heap fast like serial GC does).

>> - I am not totally sure about the policies. My current thinking is that
>> this needs some cleanup/straightening-out, or maybe I am
>> misunderstanding something there. I believe (fairly strongly) that
>> allocation failure is the single useful trigger for STW GC, and on top
>> of that an (optional) periodic GC trigger that would kick in after X
>> (milli)seconds no GC.
> 
> As I mentioned above, the GC trigger is allocation failure for the young
> generation.

Ok, good.

>> - Low-hanging-fruit improvement that could be done right now: allocate
>> large objects(arrays) straight into old-gen instead of copying them
>> around. Those are usually long-lived anyway, and copying them
>> back-and-forth just costs CPU time for no benefit. This will become even
>> more pronounced with a true old-gen.
> 
> Large arrays are allocated separately in unaligned chunks. Such arrays
> are never copied, but only logically moved from the young generation
> into the old generation. An unaligned chunk contains exactly one large
> array.

Ok, good.

>> Oh and a question: what's this pinned object/chunks/spaces all about?
> 
> There are two mechanisms right now to get objects that are never moved
> by the GC:
> 1) A "normal" object can be temporarily pinned using
> org.graalvm.nativeimage.PinnedObject. The current implementation then
> keeps the whole aligned chunk that contains the object alive, i.e., it
> is designed for pinnings that are released quickly so that no objects
> are actually ever pinned when the GC runs, unless the GC runs in an
> unlucky moments. We use such pinning for example to pass pointers into
> byte[] arrays directly to C functions without copying.
> 
> 2) A PinnedAllocator can be used to get objects that are non-moving for
> a long period of time. This is currently used for the metadata of
> runtime compiled code. We are actively working to make PinnedAllocator
> unnecessary by putting the metadata into C memory, and then hopefully we
> can remove PinnedAllocator and all code that is necessary for it in the
> GC, i.e., the notion of pinned spaces you mentioned before.

Ok, I guessed so. I mostly wondered about it because it's got from-space
and to-space:

    [pinnedFromSpace:
      aligned: 0/0 unaligned: 0/0]
    [pinnedToSpace:
      aligned: 0/0 unaligned: 0/0]]

And would it ever copy between them? I guess not.

>> What do you think about all this? Somebody else might have thought about
>> all this already, and have some insights that I don't have in my naive
>> understanding? Maybe some of it is already worked on or planned? Maybe
>> there are some big obstactles that I don't see yet, that make it less
>> feasible?
> 
> We certainly have ideas and plans, and they match your observations. If
> you are interested in contributing, we can definitely give you some
> guidance so that you immediately work into the right direction.

Yes, I am. :-)

Roman