Region set implementation

Mon Apr 17 16:24:26 UTC 2017

Hi,

I think we are at the point where we need to reconsider region set
implementation. There are several triggers for that:

 a) update-refs is recycling the regions during the post-update-refs-pause, but
never reconstructing the free set. Thus ShenandoahHeap and heuristics believe
heap has free memory, but free set runs empty earlier! While it could be solved
by fixing the post-update-refs, I think current free set implementation really
sets up for this mistake;

 b) With matrix enabled, region recycling starts to be the latency hog, because
it does sparse memory accesses on cleanup. We have alleviated it quite a bit by
optimizing the matrix code, but it is not enough. What's worse, this recycling
is done during STW, at least in update-refs and partial.

 c) With active heap mutations, we start to mix the "old" regions that have
untouched data with the recently allocated regions. This is evident in
Visualizer on some workloads. I think this happens because new allocations and
evacuations are not segregated. So, if there is a dense prefix of "old" regions,
and some region there is evacuated, it now becomes the candidate for *both* new
allocations and evacuations on the next cycle. Obviously, new allocations win
most of the time. Repeat this multiple times, and you will get more and more
newly allocated regions there, fragmenting the prefix further. While this is not
the problem for regular allocations -- the fragmentation is only in the eye of
human observer here -- it may affect humongous allocations that require large
blocks of regions.

I was thinking how to solve these in isolation, but it seems we can hit all the
birds with one stone. Consider free set not as the _list_ of regions, but as the
coarse bitmap over the indexed heap regions, where each region is in one of
these states:

 (D)irty:     has no used data, but not yet cleaned up
 (F)ree:      has no used data, ready to be allocated from
 (U)sed:      has used data, some TLABs already allocated from it
 (H)umongous: handles humongous objects

In this scheme, there are obvious transitions:
  F->U: regular TLAB allocation
  U->U: (same)
  F->H: humongous allocation
  U->D: reclaiming the normal region
  H->D: reclaiming the humongous region
  D->F: recycling regions

Note this probably means pulling out the per-region flags we have now into
ShenandoahHeap itself. That makes sense to me, to protect the transitions with
HeapLock. The operations over this region set always (?) scan the bitmap looking
for regions with interesting bits.

The problem with update-refs is solved by only requiring users to announce the
region is D, and let free set mechanics figure out the rest of transition.

The problem with matrix cleanup is solved by handling D->F transitions
asynchronously. D->F transitions would require HeapLock, but not the STW! Which
means we can mark regions D under the STW, like we do now in final mark and
post-update-refs, and then let concurrent thread to recycle regions to F. If
mutator tries to allocate, but D->F cleanup lags behind, the mutator itself can
acquire HeapLock, do the cleanup of _one_ candidate region, and allocate from it
then, thus simplifying races against recycling.

The problem with fragmentation is solved by bitmap being very dense, so that we
can afford to rescan the bitmap on allocations. This would allow us to allocate
consistently in lower regions, or even _bias_ the allocations for TLABs and
GCLABs towards different sides of the heap, thus managing fragmentation even better.

Maybe this even plays well with NUMA rebiasing, if we introduce other
intermediate states, like Fx, where "x" is the NUMA node index to which we have
rebiased the region during D->F transition.

Thoughts? Any takers?

Thanks,
-Aleksey