Perf: SATB and WB coalescing

Thu Jan 11 11:51:58 UTC 2018

Am 11.01.2018 um 12:35 schrieb Aleksey Shipilev:
> On 01/11/2018 12:19 PM, Roman Kennke wrote:
>> Am 11.01.2018 um 11:51 schrieb Aleksey Shipilev:
>>> On 01/10/2018 09:29 PM, Aleksey Shipilev wrote:
>>>> Okay, so the dirty patch for the idea:
>>>>     http://cr.openjdk.java.net/~shade/shenandoah/single-flag/webrev.00/
>>>>
>>
>>>> perfasm for the offending test:
>>>>     http://cr.openjdk.java.net/~shade/shenandoah/single-flag/single-flag.perfasm
>>>>
>>>>    *) Can we instruct compiler to trust the value of 0x3d8(%r15) until the next safepoint poll? I
>>>> think that would eliminate excessive L1 accesses for that TLS field at expense of wasting a register
>>>> -- which might be the lesser evil;
>>>
>>> Hey, this one works with the dirty hack like this:
>>>     http://cr.openjdk.java.net/~shade/shenandoah/perf-wb-satb/common-single-flag.patch
>>>
>>> It now drags commons GC state loads (and puts in the register):
>>>     http://cr.openjdk.java.net/~shade/shenandoah/perf-wb-satb/WB-SATB-commonTLS.perfasm
>>>
>>
>> Ok, this basically makes the load of the flag appear to access immutable memory. It can now
>> basically freely float above or below safepoints. We need to ensure that this cannot happen,
>> otherwise we'll see the wrong flag state. But it seems to be step #1. Maybe restore the control into
>> the LoadUBNode is enough to keep it at the right side of safepoints?
> 
> That was basically a hack to see if the idea is profitable. It appears profitable. In addition to
> that safepoint caveat, I had to disable WB coalescing, because the hack produces broken graph
> otherwise, and C2 asserts. Roland said he can sketch the real patch some time later. Meanwhile, I'd
> go and prepare the base patch for single-flag that TLS coalescing thing implicitly relies on. We can
> try other hacks if Roland has no cycles to look at it, after the base patch is done.
> 
> -Aleksey
> 

Yeah ok. I tried your hack with traversal GC. It does work, and I think 
I see some little improvement, but I guess the disabled optimization 
off-sets it a little.

I'll clean up the traversal GC and propose it soon-ish. It's not useful 
to have it wait in limbo until all possible optimizations are in place. 
Performance is already quite good (and exceeds default shenandoah for 
some workloads too, and looses some other workloads).

Thanks and cheers,
Roman