RFC: TLAB allocation and garbage-first policy

Wed Sep 20 12:44:02 UTC 2017

I can think of several solutions.  One would be to cap the max tlab size as
we discussed yesterday.  Having a tlab be an entire region has some nice
performance characteristics but isn't really necessary, nor is it in the
spirit of tlabs.

Another potential solution would be to treat these regions specially.  When
a tlab allocation fails in a region we could fill that particular region
with a filler array.  Therefore we now have garbage.  This differs from
your solution in that regular regions that are perfectly happy with normal
sized tlab spaces available aren't going to get prematurely compacted.

One common measure of GC performance is bytes copied vs bytes reclaimed.  I
will grant you that in your particular situation your solution looks
attractive, but in a myriad of other situations you are actually
pessimizing GC performance in at least one metric.

Christine

On Wed, Sep 20, 2017 at 8:18 AM, Zhengyu Gu <zgu at redhat.com> wrote:

>
> Now to the fun part about our collection policy. Our collector policy
>> selects the regions by least
>> garbage, where garbage = used - live. So, if you have the fragmented 16M
>> region with used=128K, and
>> live=128K, it is exactly 0K garbage -- the least probable candidate. So
>> the region that become
>> fragmented due to the race in TLAB machinery is also never considered for
>> collection, because it is
>> below the ShenandoahGarbageThreshold!
>>
>
> Should this region be added back to free set after GC and be reused?
>
> -Zhengyu
>
>
>
>> This race further widens when we bias the TLAB and GCLAB allocations to
>> different sides of the heap,
>> and GCLABs take the most hit. You can clearly see the anomaly in
>> Visualizer after 10+ minutes of
>> LRUFragger run with 50 GB LDS on 100 GB heap (...and it drives into Full
>> GC shortly afterwards,
>> because free set got depleted due to fragmentation!):
>>    http://cr.openjdk.java.net/~shade/shenandoah/wip-tlab-race/
>> baseline-1.png
>>
>> Therefore, I propose we choose the regions by *live size*, not by
>> *garbage*, so that we can recover
>> by collecting (and evacuating) the regions with low live, not exactly
>> with high garbage. This should
>> help to recuperate from TLAB losses better. For full regions, both
>> metrics yield the same result.
>> For half-full regions, we would have a chance to compact them into
>> mostly-full, leaving more
>> fully-empty regions around.
>>
>> I mused about this on IRC yesterday, and today I see G1 does the same, see
>> CollectionSetChooser::should_add:
>>
>>    bool should_add(HeapRegion* hr) {
>>      assert(hr->is_marked(), "pre-condition");
>>      assert(!hr->is_young(), "should never consider young regions");
>>      return !hr->is_pinned() &&
>>              hr->live_bytes() < _region_live_threshold_bytes;  // <-----
>> here
>>    }
>>
>> ...and probably with the same rationale? Found these bugs:
>>    https://bugs.openjdk.java.net/browse/JDK-7132029
>>    https://bugs.openjdk.java.net/browse/JDK-7146242
>>
>> Prototype fix:
>>
>>    virtual bool region_in_collection_set(ShenandoahHeapRegion* r, size_t
>> immediate_garbage) {
>>      size_t threshold = ShenandoahHeapRegion::region_size_bytes() *
>> ShenandoahGarbageThreshold / 100;
>> -   return r->garbage() > threshold;
>> +   if (UseNewCode) {
>> +     return (ShenandoahHeapRegion::region_size_bytes() -
>> r->get_live_data_bytes()) > threshold;
>> +   } else {
>> +     return r->garbage() > threshold;
>> +   }
>>    }
>>
>> ...makes the issue disappear on the same workload running for 30+ minutes
>> (and no Full GCs!):
>>   http://cr.openjdk.java.net/~shade/shenandoah/wip-tlab-race/
>> patched-1.png
>>
>> Thoughts?
>>
>> Thanks,
>> -Aleksey
>>
>>
>>