RFR: 8162929: Enqueuing dirty cards into a single DCQS during GC does not scale

Mon Jul 15 21:39:43 UTC 2019

> On Jul 12, 2019, at 9:52 AM, Thomas Schatzl <thomas.schatzl at oracle.com> wrote:
> On Fri, 2019-07-12 at 14:36 +0200, Thomas Schatzl wrote:
>> - it's a bit unfortunate that on the way from the redirty qset to the
>> global qsets (in G1DirtyCardQueueSet::merge_bufferlists) we can not
>> easily keep a "tail" pointer and count, and then when merging the
>> redirty qset with the dirty qset have to iterate over the BufferNodes
>> to determine it.
>> 
>> This might have some minor (seemingly unnecessary) impact on huge
>> loads like we discussed internally when doing perf testing. Probably
>> not worth the effort dealing with here, given that you may simply
>> increase the buffer size there.
>> I will do some cursory tests, with probably no detrimental outcome
>> though.
> 
> Merging the bufferlists (adding a GCtraceTime scoped object before the
> call) adds up to ~2.5ms, that results in up to 0.5% of total pause
> time, on BigRAMTester 50G with 1M regions (a somewhat contrived example
> in many ways).
> It's not much better with 32M regions either unfortunately, as GC
> itself is faster there (around ~0.3% of total pause time, little bit
> less contrived).
> 
> Is there a way to improve that?

Should be fixed now (in open.01 webrev).

> (The previous merge code took in the single digit us range)
> 
> Thanks,
>  Thomas