Thoughts about SubstrateVM GC

Thu Feb 28 22:09:34 UTC 2019

Hi Roman,

Thanks for your interest in Substrate VM!

On 2/28/19 12:59, Roman Kennke wrote:
> Hello all,
> 
> I hope this is the right mailing list to discuss SubstrateVM? If not,
> please redirect me.

There is no better list, at least at this time.

> During the last couple of days, I did have a closer look at
> SubstrateVM's GC, and also did some experiments. I would like to
> summarize what I found (so that you can correct me if I'm wrong), and
> make a case for some improvements that I would like to work on.
> 
> Here's my findings so far:
> Substrate GC is a 2-generation, STW, single-threaded GC.

yes

> The young generation is a single space. When collected, all live objects
> get scavenged out, directly into the old generation.

yes

> The old generation is 2 semispaces (actually, 4 with the 2 pinned
> spaces, which I'll ask about later). When collected, live objects get
> scavenged back-and-forth between the two spaces.

yes, in theory. In "reality" there is no contiguous memory for the 
spaces, so after a GC all the aligned chunks of the from-space are 
either returned to the OS or cached and immediately reused for young 
generation allocation.

> Is that correct so far?
> 
> It seemed a bit weird at first to write a Java GC in Java language. :-)

It makes some things easier, e.g., the object layout code used by the GC 
can immediately be used in other parts of the VM and the compiler. But 
in the end there is C-style memory access of course to actually process 
the objects, and that code is more verbose in Java.

> I analyzed a bit of generated assembly code, comparing it side-by-side
> with corresponding Java code, and was actually quite impressed by it.
> It's also got room for improvements, but that was not the major
> bottleneck. The single major bottleneck in my experiments was waiting
> for loads of the mark word during scavenging, in other words, it's doing
> way too much of it ;-)
> 
> I have noticed a bunch of problems so far:
> - The promotion rate between young-gen and old-gen seems fairly hot.
> This is because there is no notion of tenuring objects or so.
> - This implies that there are relatively many old-gen collections
> happening, which seriously affect application throughput (once they happen)
> - Because of the above, the usual wisdoms from other GCs don't apply: I
> could get significant improvements (i.e. fewer diving into full-GCs) by
> configuring a small young-gen (like 10%) and large old-gen (like 90%).
> But that's not really great either.

Our "design goal" was to start out with the simplest possible GC 
implementation that is viable (i.e., generational). That was the 
starting point, and we have not had time yet to do something better. But 
we know something better is certainly needed.

> - The policy when to start collecting seems a bit unclear to me. In my
> understanding, there is (almost) no use (for STW GCs) in starting a
> collection before allocation space is exhausted. Which means, it seems
> an obvious trigger to start collection on allocation failure. Yet, the
> policies I'm looking at are time-based or time-and-space-based. I said
> 'almost' because the single use for time-based collection would be
> periodic GC that is able to knock out lingering garbage during
> no/little-allocation phase of an application, and then only when the GC
> is also uncommitting the resulting unused pages (which, afaics,
> Substrate GC would do: bravo!). But that doesn't seem to be the point of
> the time-based policies: it looks like the goal of those policies is to
> balance time spent in young-vs-old-gen collections.?!

The young generation size is fixed, and a collection is started when 
this space is filled. So from that point of view, the starting trigger 
of a GC is always space based.

The policy whether to do an incremental (young) collection or a full 
collection is time based (but you can easily plug in any policy you 
want). The goal is to balance time between incremental and full 
collection. We certainly don't want to fill up the old generation 
because that is the maximum heap size, and it is by default provisioned 
to be very big.

> With a little bit of distance and squinting of eyes, one can see that
> Substrate GC's young generation is really what is called 'nursery space'
> elsewhere, which aims to reduce the rate at which objects get introduced
> into young generation. And the old generation is really what is usually
> called young generation elsewhere. What's missing is a true old
> generation space.

Not really, because the young generation can be collected independently, 
i.e., there are the generational write barriers, remembered sets, ...

So the young generation is reduced to the nursery space, but I argue the 
old generation is really an old generation.

> Considering all this, I would like to propose some improvements:
> - Introduce a notion of tenuring objects. I guess we need like 2 age
> bits in the header or elsewhere for this. Do we have that room?

You don't need the age bits in the header. You can easily go from the 
object to the aligned chunk that the object is in (we do that all the 
time, for example for the write barrier to do the card marking), and 
store the age in the chunk header. Requiring all objects in one chunk to 
have the same age is not much of a limitation.

Adding tenuring is definitely necessary to achieve reasonable GC 
performance.

> - Implement a true old-space (and rename the existing young to nursery
> and old to young?). In my experience, sliding/mark-compact collection
> using a mark bitmap works best for this: it tends to create a 'sediment'
> of permanent/very-long-lived objects at the bottom which would never get
> copied again. Using a bitmap, walking of live objects (e.g. during
> copying, updating etc) would be very fast: much faster than walking
> objects by their size.

A mark-and-compact old generation algorithm definitely makes sense. 
Again, the only reason why we don't have it yet is that no one had time 
to implement it.

Mark-and-compact is also great to reduce memory footprint. Right now, 
during GC the memory footprint can double because of the temporary space 
for copying.

> - I am not totally sure about the policies. My current thinking is that
> this needs some cleanup/straightening-out, or maybe I am
> misunderstanding something there. I believe (fairly strongly) that
> allocation failure is the single useful trigger for STW GC, and on top
> of that an (optional) periodic GC trigger that would kick in after X
> (milli)seconds no GC.

As I mentioned above, the GC trigger is allocation failure for the young 
generation.

> - Low-hanging-fruit improvement that could be done right now: allocate
> large objects(arrays) straight into old-gen instead of copying them
> around. Those are usually long-lived anyway, and copying them
> back-and-forth just costs CPU time for no benefit. This will become even
> more pronounced with a true old-gen.

Large arrays are allocated separately in unaligned chunks. Such arrays 
are never copied, but only logically moved from the young generation 
into the old generation. An unaligned chunk contains exactly one large 
array.

> Oh and a question: what's this pinned object/chunks/spaces all about?

There are two mechanisms right now to get objects that are never moved 
by the GC:
1) A "normal" object can be temporarily pinned using 
org.graalvm.nativeimage.PinnedObject. The current implementation then 
keeps the whole aligned chunk that contains the object alive, i.e., it 
is designed for pinnings that are released quickly so that no objects 
are actually ever pinned when the GC runs, unless the GC runs in an 
unlucky moments. We use such pinning for example to pass pointers into 
byte[] arrays directly to C functions without copying.

2) A PinnedAllocator can be used to get objects that are non-moving for 
a long period of time. This is currently used for the metadata of 
runtime compiled code. We are actively working to make PinnedAllocator 
unnecessary by putting the metadata into C memory, and then hopefully we 
can remove PinnedAllocator and all code that is necessary for it in the 
GC, i.e., the notion of pinned spaces you mentioned before.

> What do you think about all this? Somebody else might have thought about
> all this already, and have some insights that I don't have in my naive
> understanding? Maybe some of it is already worked on or planned? Maybe
> there are some big obstactles that I don't see yet, that make it less
> feasible?

We certainly have ideas and plans, and they match your observations. If 
you are interested in contributing, we can definitely give you some 
guidance so that you immediately work into the right direction.

-Christian