Thoughts about SubstrateVM GC

Fri Mar 1 20:33:24 UTC 2019

Hi Peter,

picking up on our comments here:

https://github.com/oracle/graal/pull/1015

It seems like there is a significant class of applications that would be
attractive with SubstrateVM that would be happy to do no GC at all.
Shortlived microservices come to mind. For those applications, we'd
realistically do want a GC, but only as last resort. In other words, a
single-space heap, that would trigger a collection only when exhausted,
and then do sliding compaction, would do it, and arguably better than
the current GC: it would not require barriers overhead (i.e. pay for
something that we bet would never or very rarely happen), it would be
able to use the full space, rather than dividing up in generations and
semispaces, and as you say, even eliminate the safepoint checks overhead
(does SubstrateVM only do safepoints for GC? that would be cool!).

Even if it *does* GC, it might still be better off overall: objects
would have more time to die, GC frequency would be rarer. With current
SubstrateVM I see relatively frequent full-GCs anyway, so rarer GCs with
more relative garbage, combined with increased throughput: I guess that
might actually work well for certain classes of applications, especially
those that would want to run on SubstrateVM anyway?

You commented about GraalVM runtime compiler allocating a lot and
leaving behind lots of garbage: would that be a concern in SustrateVM's
closed world view?

WDYT?

Roman

> Two comments inline.  And more encouragement to send along your ideas.
> 
>             ... peter
> 
> On 03/ 1/19 02:16 AM, Roman Kennke wrote:
>> Hi Christian,
>>
>> thanks for your replies. This is very interesting. Some additional
>> comments below inline:
>>
>>>> The old generation is 2 semispaces (actually, 4 with the 2 pinned
>>>> spaces, which I'll ask about later). When collected, live objects get
>>>> scavenged back-and-forth between the two spaces.
>>>
>>> yes, in theory. In "reality" there is no contiguous memory for the
>>> spaces, so after a GC all the aligned chunks of the from-space are
>>> either returned to the OS or cached and immediately reused for young
>>> generation allocation.
>>
>> Aha, ok. This definitely affects future design decisions.
>>
>>
>>>> - The policy when to start collecting seems a bit unclear to me. In my
>>>> understanding, there is (almost) no use (for STW GCs) in starting a
>>>> collection before allocation space is exhausted. Which means, it seems
>>>> an obvious trigger to start collection on allocation failure. Yet, the
>>>> policies I'm looking at are time-based or time-and-space-based. I said
>>>> 'almost' because the single use for time-based collection would be
>>>> periodic GC that is able to knock out lingering garbage during
>>>> no/little-allocation phase of an application, and then only when the GC
>>>> is also uncommitting the resulting unused pages (which, afaics,
>>>> Substrate GC would do: bravo!). But that doesn't seem to be the
>>>> point of
>>>> the time-based policies: it looks like the goal of those policies is to
>>>> balance time spent in young-vs-old-gen collections.?!
>>>
>>> The young generation size is fixed, and a collection is started when
>>> this space is filled. So from that point of view, the starting trigger
>>> of a GC is always space based.
>>>
>>> The policy whether to do an incremental (young) collection or a full
>>> collection is time based (but you can easily plug in any policy you
>>> want). The goal is to balance time between incremental and full
>>> collection. We certainly don't want to fill up the old generation
>>> because that is the maximum heap size, and it is by default provisioned
>>> to be very big.
>>
>> Ok, I see.
>> Also, the way it's currently done, the old-gen needs to be able to
>> absorb (worst-case) all of young-gen in next cycle, and therefore needs
>> *plenty* of headroom. I.e. we need to collect old-gen much earlier than
>> when it's full (like when remaining free space in old-gen is smaller
>> than young-gen size). Alternatively, we could exhaust old-gen, and
>> change young-gen-collection-policy to skip collection if old-gen doesn't
>> have enough space left, and dive into full-GC right away. Or, even
>> better, add an intermediate tenuring generation. :-)
> 
> There is no fixed heap size, or fixed generation sizes.  As long as the
> collector can allocate memory from the OS it can keep adding chunks as
> needed to the old generation (or the young generation, for that matter. 
> E.g., to delay collection until it is "convenient".)  If you run out of
> address space, or physical memory, then you are in trouble.
> 
>>
>>
>>>> With a little bit of distance and squinting of eyes, one can see that
>>>> Substrate GC's young generation is really what is called 'nursery
>>>> space'
>>>> elsewhere, which aims to reduce the rate at which objects get
>>>> introduced
>>>> into young generation. And the old generation is really what is usually
>>>> called young generation elsewhere. What's missing is a true old
>>>> generation space.
>>>
>>> Not really, because the young generation can be collected independently,
>>> i.e., there are the generational write barriers, remembered sets, ...
>>>
>>> So the young generation is reduced to the nursery space, but I argue the
>>> old generation is really an old generation.
>>
>> Ok.
>>
>>>> Considering all this, I would like to propose some improvements:
>>>> - Introduce a notion of tenuring objects. I guess we need like 2 age
>>>> bits in the header or elsewhere for this. Do we have that room?
>>>
>>> You don't need the age bits in the header. You can easily go from the
>>> object to the aligned chunk that the object is in (we do that all the
>>> time, for example for the write barrier to do the card marking), and
>>> store the age in the chunk header. Requiring all objects in one chunk to
>>> have the same age is not much of a limitation.
>> Right.
>>
>>> Adding tenuring is definitely necessary to achieve reasonable GC
>>> performance.
>>
>> +1
>>
>>>> - Implement a true old-space (and rename the existing young to nursery
>>>> and old to young?). In my experience, sliding/mark-compact collection
>>>> using a mark bitmap works best for this: it tends to create a
>>>> 'sediment'
>>>> of permanent/very-long-lived objects at the bottom which would never
>>>> get
>>>> copied again. Using a bitmap, walking of live objects (e.g. during
>>>> copying, updating etc) would be very fast: much faster than walking
>>>> objects by their size.
>>>
>>> A mark-and-compact old generation algorithm definitely makes sense.
>>> Again, the only reason why we don't have it yet is that no one had time
>>> to implement it.
>>>
>>> Mark-and-compact is also great to reduce memory footprint. Right now,
>>> during GC the memory footprint can double because of the temporary space
>>> for copying.
>>
>> Yeah. However, as Peter noted, having no contiguous memory block
>> complicates this. I'd need to see how to deal with it (per-chunk-bitmap
>> probably, or maybe mark bit in object header, with some clever tricks to
>> make scanning the heap fast like serial GC does).
>>
>>>> - I am not totally sure about the policies. My current thinking is that
>>>> this needs some cleanup/straightening-out, or maybe I am
>>>> misunderstanding something there. I believe (fairly strongly) that
>>>> allocation failure is the single useful trigger for STW GC, and on top
>>>> of that an (optional) periodic GC trigger that would kick in after X
>>>> (milli)seconds no GC.
>>>
>>> As I mentioned above, the GC trigger is allocation failure for the young
>>> generation.
>>
>> Ok, good.
>>
>>>> - Low-hanging-fruit improvement that could be done right now: allocate
>>>> large objects(arrays) straight into old-gen instead of copying them
>>>> around. Those are usually long-lived anyway, and copying them
>>>> back-and-forth just costs CPU time for no benefit. This will become
>>>> even
>>>> more pronounced with a true old-gen.
>>>
>>> Large arrays are allocated separately in unaligned chunks. Such arrays
>>> are never copied, but only logically moved from the young generation
>>> into the old generation. An unaligned chunk contains exactly one large
>>> array.
>>
>> Ok, good.
>>
>>>> Oh and a question: what's this pinned object/chunks/spaces all about?
>>>
>>> There are two mechanisms right now to get objects that are never moved
>>> by the GC:
>>> 1) A "normal" object can be temporarily pinned using
>>> org.graalvm.nativeimage.PinnedObject. The current implementation then
>>> keeps the whole aligned chunk that contains the object alive, i.e., it
>>> is designed for pinnings that are released quickly so that no objects
>>> are actually ever pinned when the GC runs, unless the GC runs in an
>>> unlucky moments. We use such pinning for example to pass pointers into
>>> byte[] arrays directly to C functions without copying.
>>>
>>> 2) A PinnedAllocator can be used to get objects that are non-moving for
>>> a long period of time. This is currently used for the metadata of
>>> runtime compiled code. We are actively working to make PinnedAllocator
>>> unnecessary by putting the metadata into C memory, and then hopefully we
>>> can remove PinnedAllocator and all code that is necessary for it in the
>>> GC, i.e., the notion of pinned spaces you mentioned before.
>>
>> Ok, I guessed so. I mostly wondered about it because it's got from-space
>> and to-space:
>>
>>      [pinnedFromSpace:
>>        aligned: 0/0 unaligned: 0/0]
>>      [pinnedToSpace:
>>        aligned: 0/0 unaligned: 0/0]]
>>
>> And would it ever copy between them? I guess not.
> 
> The collector logically moves a pinned chunk from pinned from-space to
> pinned to-space by updating bookkeeping information in the chunk.  The
> contents of the pinned chunk are not moved, and their addresses do not
> change.  If a pinned chunk is unpinned by the application, it is moved
> to the unpinned from-space and at the next full collection the reachable
> objects in it are scavenged to the unpinned to-space, like any other
> objects in unpinned from-space.  Between collections the pinned to-space
> is empty.  In your example, the pinned from-space is also empty.  Spaces
> do not represent address ranges, so an empty space is just a few null
> pointers in the space data structure.  (Spaces not being address ranges
> also complicates answer questions like: Is this object in the young
> generation.)
> 
>             ... peter
> 
>>
>>>> What do you think about all this? Somebody else might have thought
>>>> about
>>>> all this already, and have some insights that I don't have in my naive
>>>> understanding? Maybe some of it is already worked on or planned? Maybe
>>>> there are some big obstactles that I don't see yet, that make it less
>>>> feasible?
>>>
>>> We certainly have ideas and plans, and they match your observations. If
>>> you are interested in contributing, we can definitely give you some
>>> guidance so that you immediately work into the right direction.
>>
>> Yes, I am. :-)
>>
>> Roman
>>
>