JEP 189: Shenandoah: An Ultra-Low-Pause-Time Garbage Collector

Tue Jan 21 09:38:34 UTC 2014

Hi Christine,

  great to see somebody hacking GCs in Hotspot. Welcome :)

On Mon, 2014-01-20 at 16:17 -0500, Christine Flood wrote:
> 
> ----- Original Message -----
> > From: "Kirk Pepperdine" <kirk at kodewerk.com>
> > Which we all happily grant :-)
> > 
> > BTW, the history of G1 is that is started without a generational perturbation
> > but it was quickly determined that the gains to be made with having a
> > separate nursery was too great to ignore. IBM also tried to keep a single
> > space for quite some time but eventually backed out to a generational
> > arrangement. My question is; what is in this collector that makes you
> > believe you can ignore the weak generational hypothesis?
> > 
> 
> The benefit of generational systems is as much about remembered set size as it
>is about object lifetimes.  Generational G1 is a win because you don't
>need to keep track of pointers from young regions to older regions. 
>The young regions are collected at every GC pause and will be traced
>then.  

It's a simple, reasonably robust heuristic that tells G1 where there is
most memory in an efficient way to reclaim. Without the need for a full
liveness analysis or other means to determine object death. That is not
free either.

The remembered set savings (and in general data structure and runtime
savings) are secondary imo. The optimizations pose a nice additional
benefit, but since probably the biggest point of a GC is to reclaim
memory, it seems beneficial to look first where there is most likely
much unused space.

Also, the number of references from old to young can be large too.

After introducing this heuristic and knowing that it is here to stay,
you will certainly try to exploit it as much as possible, won't you?

Some ideas off the top of the head to save additional remembered set
size are that you could arbitrarily group regions into more "young
gens", or just lazily create the remembered sets for regions as needed
(again, using the marking thread). Or use a different remembered set
implementation that has a different memory usage/performance tradeoff.

That would be something interesting to explore, as the remembered set
itself can be made as small or large as you desire, depending on the
cost/benefit tradeoff you want to make and what your other goals are.

Replacing the RS implementation should be reasonably simple for any
programmer with some introduction to the code, at least compared to
introducing a new GC :) Just maybe not so simple getting the same
performance across a large set of applications. I am sure though there
is much room for improvement.

The problem at least in G1 in that area is imo more of a matter of
getting the selection policy for regions you want to keep the remembered
set for right at a particular moment. Shenandoah will need to make
similar selection decisions. Maybe this is somewhat easier there because
you have a complete overview of liveness after marking. (But you can
continously run concurrent markings there too - at a cost at
throughput).

> Our algorithm doesn't require remembered sets because reference updates happen
>lazily.  We don't need to be able to find every reference to an object
>and update it, we can wait for the concurrent marking thread to find
>those references and update them for us.  This gives us a different
>cost/benefit analysis.  I believe Shenandoah fulfills the promise of G1

Shenandoah trades remembered sets and pauselessness by intentionally
slowing down the mutator using "complex" read/write barriers and the
additional indirection from the Brooks pointers. Also there is the
increased memory footprint that a user needs to be aware of.

The Brooks pointer is an additional pointer per object (you mentioned
>60G heaps, so I will assume 8 bytes per pointer) - at an average object
size of maybe 30-40 bytes (just a guess which should be somewhat
correct), this increases the footprint of the application by 15-20%.
That increase alone (I assume) will impact memory access performance
(caching, bandwidth).

So it's really just a different tradeoff Shenandoah takes. I do think
everyone's really interested already in how this tradeoff looks like at
the moment. :)

Not that the G1 remembered set is particularly small :), but first it's
less than that, and as mentioned above there is a lot potential for
memory savings. Shenandoah cannot do away with the Brooks pointers.

Otoh G1 has a few pauses (which we are working on greatly reducing at
the moment).

>by concentrating GC effort where the garbage is whether or not that is
>in young regions. There is no distinction in Shenandoah between
>allocating in a separate eden and allocating in an empty region, you
>would perform exactly the same work.

I can imagine this was how the non-generational G1 worked initially too.
It did not work out to have the expected performance, or at least it has
been shown extremely beneficial for a few reasons to by default collect
the nursery regions. Probably mainly because G1 would select the regions
it recently allocated into anyway.

As you mention, Shenandoah will/should likely also automatically do the
same too.

Again, note that young gen in G1 is really mostly about selecting the
"best" regions for reclamation. You may notice most code in G1 is only
interested in the collection set, not the young gen.

> > As an aside, I have no bias for generational collectors but I have seen
> > places where having a continuously running collector would have been very
> > advantageous. For example, I had one case we had a hard 16ms time budget on
> > bursts of fixed units of work. In that case the work could be completed in 5
> > or 6 ms which left 10ms for GC (should it decide to run and undoubtably
> > would have run in the middle of the 5-6ms burst of work) but we could never
> > get the collector to run consistently in under 10ms without risking an OOME.
> > I’m pretty convinced that continuous collection policy would have helped us
> > meet that goal and we had plenty of CPU to throw at the problem so….

I think another problem related to hardware is memory, not so much CPU.
As you mention, there are typically always enough CPUs to throw at the
problem. Increasing the amount of memory starting from a certain
threshold gets comparatively expensive, and managing to have good memory
bandwidth (at good latency) even more so.

Others probably have more insight into that.

Thanks, and welcome again,
Thomas