Thoughts about SubstrateVM GC

Fri Mar 1 22:13:09 UTC 2019

I think we need requirements from users, representative benchmarks or real applications, inspired ideas, measurements of alternatives, and time to write garbage collectors. :-)

So far, SubstrateVM has explored one point in the space, and we know that one size does not fit all when it comes to garbage collection.

			... peter

On 03/ 1/19 12:33 PM, Roman Kennke wrote:
> Hi Peter,
> 
> picking up on our comments here:
> 
> https://github.com/oracle/graal/pull/1015
> 
> It seems like there is a significant class of applications that would be
> attractive with SubstrateVM that would be happy to do no GC at all.
> Shortlived microservices come to mind. For those applications, we'd
> realistically do want a GC, but only as last resort. In other words, a
> single-space heap, that would trigger a collection only when exhausted,
> and then do sliding compaction, would do it, and arguably better than
> the current GC: it would not require barriers overhead (i.e. pay for
> something that we bet would never or very rarely happen), it would be
> able to use the full space, rather than dividing up in generations and
> semispaces, and as you say, even eliminate the safepoint checks overhead
> (does SubstrateVM only do safepoints for GC? that would be cool!).
> 
> Even if it *does* GC, it might still be better off overall: objects
> would have more time to die, GC frequency would be rarer. With current
> SubstrateVM I see relatively frequent full-GCs anyway, so rarer GCs with
> more relative garbage, combined with increased throughput: I guess that
> might actually work well for certain classes of applications, especially
> those that would want to run on SubstrateVM anyway?
> 
> You commented about GraalVM runtime compiler allocating a lot and
> leaving behind lots of garbage: would that be a concern in SustrateVM's
> closed world view?
> 
> WDYT?
> 
> Roman
> 
>> Two comments inline.  And more encouragement to send along your ideas.
>>
>>              ... peter
>>
>> On 03/ 1/19 02:16 AM, Roman Kennke wrote:
>>> Hi Christian,
>>>
>>> thanks for your replies. This is very interesting. Some additional
>>> comments below inline:
>>>
>>>>> The old generation is 2 semispaces (actually, 4 with the 2 pinned
>>>>> spaces, which I'll ask about later). When collected, live objects get
>>>>> scavenged back-and-forth between the two spaces.
>>>>
>>>> yes, in theory. In "reality" there is no contiguous memory for the
>>>> spaces, so after a GC all the aligned chunks of the from-space are
>>>> either returned to the OS or cached and immediately reused for young
>>>> generation allocation.
>>>
>>> Aha, ok. This definitely affects future design decisions.
>>>
>>>
>>>>> - The policy when to start collecting seems a bit unclear to me. In my
>>>>> understanding, there is (almost) no use (for STW GCs) in starting a
>>>>> collection before allocation space is exhausted. Which means, it seems
>>>>> an obvious trigger to start collection on allocation failure. Yet, the
>>>>> policies I'm looking at are time-based or time-and-space-based. I said
>>>>> 'almost' because the single use for time-based collection would be
>>>>> periodic GC that is able to knock out lingering garbage during
>>>>> no/little-allocation phase of an application, and then only when the GC
>>>>> is also uncommitting the resulting unused pages (which, afaics,
>>>>> Substrate GC would do: bravo!). But that doesn't seem to be the
>>>>> point of
>>>>> the time-based policies: it looks like the goal of those policies is to
>>>>> balance time spent in young-vs-old-gen collections.?!
>>>>
>>>> The young generation size is fixed, and a collection is started when
>>>> this space is filled. So from that point of view, the starting trigger
>>>> of a GC is always space based.
>>>>
>>>> The policy whether to do an incremental (young) collection or a full
>>>> collection is time based (but you can easily plug in any policy you
>>>> want). The goal is to balance time between incremental and full
>>>> collection. We certainly don't want to fill up the old generation
>>>> because that is the maximum heap size, and it is by default provisioned
>>>> to be very big.
>>>
>>> Ok, I see.
>>> Also, the way it's currently done, the old-gen needs to be able to
>>> absorb (worst-case) all of young-gen in next cycle, and therefore needs
>>> *plenty* of headroom. I.e. we need to collect old-gen much earlier than
>>> when it's full (like when remaining free space in old-gen is smaller
>>> than young-gen size). Alternatively, we could exhaust old-gen, and
>>> change young-gen-collection-policy to skip collection if old-gen doesn't
>>> have enough space left, and dive into full-GC right away. Or, even
>>> better, add an intermediate tenuring generation. :-)
>>
>> There is no fixed heap size, or fixed generation sizes.  As long as the
>> collector can allocate memory from the OS it can keep adding chunks as
>> needed to the old generation (or the young generation, for that matter.
>> E.g., to delay collection until it is "convenient".)  If you run out of
>> address space, or physical memory, then you are in trouble.
>>
>>>
>>>
>>>>> With a little bit of distance and squinting of eyes, one can see that
>>>>> Substrate GC's young generation is really what is called 'nursery
>>>>> space'
>>>>> elsewhere, which aims to reduce the rate at which objects get
>>>>> introduced
>>>>> into young generation. And the old generation is really what is usually
>>>>> called young generation elsewhere. What's missing is a true old
>>>>> generation space.
>>>>
>>>> Not really, because the young generation can be collected independently,
>>>> i.e., there are the generational write barriers, remembered sets, ...
>>>>
>>>> So the young generation is reduced to the nursery space, but I argue the
>>>> old generation is really an old generation.
>>>
>>> Ok.
>>>
>>>>> Considering all this, I would like to propose some improvements:
>>>>> - Introduce a notion of tenuring objects. I guess we need like 2 age
>>>>> bits in the header or elsewhere for this. Do we have that room?
>>>>
>>>> You don't need the age bits in the header. You can easily go from the
>>>> object to the aligned chunk that the object is in (we do that all the
>>>> time, for example for the write barrier to do the card marking), and
>>>> store the age in the chunk header. Requiring all objects in one chunk to
>>>> have the same age is not much of a limitation.
>>> Right.
>>>
>>>> Adding tenuring is definitely necessary to achieve reasonable GC
>>>> performance.
>>>
>>> +1
>>>
>>>>> - Implement a true old-space (and rename the existing young to nursery
>>>>> and old to young?). In my experience, sliding/mark-compact collection
>>>>> using a mark bitmap works best for this: it tends to create a
>>>>> 'sediment'
>>>>> of permanent/very-long-lived objects at the bottom which would never
>>>>> get
>>>>> copied again. Using a bitmap, walking of live objects (e.g. during
>>>>> copying, updating etc) would be very fast: much faster than walking
>>>>> objects by their size.
>>>>
>>>> A mark-and-compact old generation algorithm definitely makes sense.
>>>> Again, the only reason why we don't have it yet is that no one had time
>>>> to implement it.
>>>>
>>>> Mark-and-compact is also great to reduce memory footprint. Right now,
>>>> during GC the memory footprint can double because of the temporary space
>>>> for copying.
>>>
>>> Yeah. However, as Peter noted, having no contiguous memory block
>>> complicates this. I'd need to see how to deal with it (per-chunk-bitmap
>>> probably, or maybe mark bit in object header, with some clever tricks to
>>> make scanning the heap fast like serial GC does).
>>>
>>>>> - I am not totally sure about the policies. My current thinking is that
>>>>> this needs some cleanup/straightening-out, or maybe I am
>>>>> misunderstanding something there. I believe (fairly strongly) that
>>>>> allocation failure is the single useful trigger for STW GC, and on top
>>>>> of that an (optional) periodic GC trigger that would kick in after X
>>>>> (milli)seconds no GC.
>>>>
>>>> As I mentioned above, the GC trigger is allocation failure for the young
>>>> generation.
>>>
>>> Ok, good.
>>>
>>>>> - Low-hanging-fruit improvement that could be done right now: allocate
>>>>> large objects(arrays) straight into old-gen instead of copying them
>>>>> around. Those are usually long-lived anyway, and copying them
>>>>> back-and-forth just costs CPU time for no benefit. This will become
>>>>> even
>>>>> more pronounced with a true old-gen.
>>>>
>>>> Large arrays are allocated separately in unaligned chunks. Such arrays
>>>> are never copied, but only logically moved from the young generation
>>>> into the old generation. An unaligned chunk contains exactly one large
>>>> array.
>>>
>>> Ok, good.
>>>
>>>>> Oh and a question: what's this pinned object/chunks/spaces all about?
>>>>
>>>> There are two mechanisms right now to get objects that are never moved
>>>> by the GC:
>>>> 1) A "normal" object can be temporarily pinned using
>>>> org.graalvm.nativeimage.PinnedObject. The current implementation then
>>>> keeps the whole aligned chunk that contains the object alive, i.e., it
>>>> is designed for pinnings that are released quickly so that no objects
>>>> are actually ever pinned when the GC runs, unless the GC runs in an
>>>> unlucky moments. We use such pinning for example to pass pointers into
>>>> byte[] arrays directly to C functions without copying.
>>>>
>>>> 2) A PinnedAllocator can be used to get objects that are non-moving for
>>>> a long period of time. This is currently used for the metadata of
>>>> runtime compiled code. We are actively working to make PinnedAllocator
>>>> unnecessary by putting the metadata into C memory, and then hopefully we
>>>> can remove PinnedAllocator and all code that is necessary for it in the
>>>> GC, i.e., the notion of pinned spaces you mentioned before.
>>>
>>> Ok, I guessed so. I mostly wondered about it because it's got from-space
>>> and to-space:
>>>
>>>       [pinnedFromSpace:
>>>         aligned: 0/0 unaligned: 0/0]
>>>       [pinnedToSpace:
>>>         aligned: 0/0 unaligned: 0/0]]
>>>
>>> And would it ever copy between them? I guess not.
>>
>> The collector logically moves a pinned chunk from pinned from-space to
>> pinned to-space by updating bookkeeping information in the chunk.  The
>> contents of the pinned chunk are not moved, and their addresses do not
>> change.  If a pinned chunk is unpinned by the application, it is moved
>> to the unpinned from-space and at the next full collection the reachable
>> objects in it are scavenged to the unpinned to-space, like any other
>> objects in unpinned from-space.  Between collections the pinned to-space
>> is empty.  In your example, the pinned from-space is also empty.  Spaces
>> do not represent address ranges, so an empty space is just a few null
>> pointers in the space data structure.  (Spaces not being address ranges
>> also complicates answer questions like: Is this object in the young
>> generation.)
>>
>>              ... peter
>>
>>>
>>>>> What do you think about all this? Somebody else might have thought
>>>>> about
>>>>> all this already, and have some insights that I don't have in my naive
>>>>> understanding? Maybe some of it is already worked on or planned? Maybe
>>>>> there are some big obstactles that I don't see yet, that make it less
>>>>> feasible?
>>>>
>>>> We certainly have ideas and plans, and they match your observations. If
>>>> you are interested in contributing, we can definitely give you some
>>>> guidance so that you immediately work into the right direction.
>>>
>>> Yes, I am. :-)
>>>
>>> Roman
>>>
>>
>