RFC: TLAB allocation and garbage-first policy

Wed Sep 20 12:18:11 UTC 2017

> Now to the fun part about our collection policy. Our collector policy selects the regions by least
> garbage, where garbage = used - live. So, if you have the fragmented 16M region with used=128K, and
> live=128K, it is exactly 0K garbage -- the least probable candidate. So the region that become
> fragmented due to the race in TLAB machinery is also never considered for collection, because it is
> below the ShenandoahGarbageThreshold!

Should this region be added back to free set after GC and be reused?

-Zhengyu

> 
> This race further widens when we bias the TLAB and GCLAB allocations to different sides of the heap,
> and GCLABs take the most hit. You can clearly see the anomaly in Visualizer after 10+ minutes of
> LRUFragger run with 50 GB LDS on 100 GB heap (...and it drives into Full GC shortly afterwards,
> because free set got depleted due to fragmentation!):
>    http://cr.openjdk.java.net/~shade/shenandoah/wip-tlab-race/baseline-1.png
> 
> Therefore, I propose we choose the regions by *live size*, not by *garbage*, so that we can recover
> by collecting (and evacuating) the regions with low live, not exactly with high garbage. This should
> help to recuperate from TLAB losses better. For full regions, both metrics yield the same result.
> For half-full regions, we would have a chance to compact them into mostly-full, leaving more
> fully-empty regions around.
> 
> I mused about this on IRC yesterday, and today I see G1 does the same, see
> CollectionSetChooser::should_add:
> 
>    bool should_add(HeapRegion* hr) {
>      assert(hr->is_marked(), "pre-condition");
>      assert(!hr->is_young(), "should never consider young regions");
>      return !hr->is_pinned() &&
>              hr->live_bytes() < _region_live_threshold_bytes;  // <----- here
>    }
> 
> ...and probably with the same rationale? Found these bugs:
>    https://bugs.openjdk.java.net/browse/JDK-7132029
>    https://bugs.openjdk.java.net/browse/JDK-7146242
> 
> Prototype fix:
> 
>    virtual bool region_in_collection_set(ShenandoahHeapRegion* r, size_t immediate_garbage) {
>      size_t threshold = ShenandoahHeapRegion::region_size_bytes() * ShenandoahGarbageThreshold / 100;
> -   return r->garbage() > threshold;
> +   if (UseNewCode) {
> +     return (ShenandoahHeapRegion::region_size_bytes() - r->get_live_data_bytes()) > threshold;
> +   } else {
> +     return r->garbage() > threshold;
> +   }
>    }
> 
> ...makes the issue disappear on the same workload running for 30+ minutes (and no Full GCs!):
>   http://cr.openjdk.java.net/~shade/shenandoah/wip-tlab-race/patched-1.png
> 
> Thoughts?
> 
> Thanks,
> -Aleksey
> 
>