RFR (M) CR 8050147: StoreLoad barrier interferes with stack usages

Wed Jul 30 17:20:21 UTC 2014

On 07/23/2014 11:38 AM, Aleksey Shipilev wrote:
> On 07/23/2014 01:38 AM, Aleksey Shipilev wrote:
>>  * 1x4x1 Intel Core (Haswell-DT) benefits from offsetting %esp as well.
>> There is an interesting hump on lower backoffs with addl (%esp-8), which
>> seems to be a second-order microarchitectural effect. Unfortunately, we
>> don't have large Haswells available at the moment to dive into this:
>>   http://cr.openjdk.java.net/~shade/8050147/micros/Intel-Haswell.data.png
> 
> Finally managed to get the profile on this Haswell. The hottest loops
> are identical, although the second variant is consistently better.
> 
> This produces 1200.318 +- 18.520 ns/op:
> 
>   2.65%             0x00007fdcb87190b0: mov    (%rsp),%r10
>                     0x00007fdcb87190b4: mov    0xc(%r10),%r10d
>   0.06%             0x00007fdcb87190b8: mov    %r10d,0xc(%rbp)
>   2.66%             0x00007fdcb87190bc: lock addl $0x0,-0x8(%rsp)
>  80.08%   95.59%    0x00007fdcb87190c2: mov    %rbp,%rsi
>                     0x00007fdcb87190c5: xchg   %ax,%ax
>                     0x00007fdcb87190c7: callq  0x00007fdcb85d7fe0
>                     0x00007fdcb87190cc: test   %eax,%eax
>                     0x00007fdcb87190ce: jne    0x00007fdcb87190b0
> 
> This produces 891.427 +- 15.277 ns/op:
> 
>   3.52%    0.02%    0x00007ffa3918f030: mov    (%rsp),%r10
>                     0x00007ffa3918f034: mov    0xc(%r10),%r10d
>   0.07%             0x00007ffa3918f038: mov    %r10d,0xc(%rbp)
>   3.68%             0x00007ffa3918f03c: lock addl $0x0,-0x48(%rsp)
>  76.54%   97.13%    0x00007ffa3918f042: mov    %rbp,%rsi
>                     0x00007ffa3918f045: xchg   %ax,%ax
>                     0x00007ffa3918f047: callq  0x00007ffa39045fe0
>                     0x00007ffa3918f04c: test   %eax,%eax
>                     0x00007ffa3918f04e: jne    0x00007ffa3918f030
> 
> Full logs here:
>  http://cr.openjdk.java.net/~shade/8050147/haswell-barrier2.perfasm
>  http://cr.openjdk.java.net/~shade/8050147/haswell-barrier3.perfasm
> 
> It is puzzling to me why do we have the difference here. In the logs
> there, you may see the second-hottest method, looping(), which is called
> in this busy loop. It does:
> 
>  2.60%             0x00007fdcb871b200: sub    $0x18,%rsp
>                    0x00007fdcb871b207: mov    %rbp,0x10(%rsp)
> 
> ...which seems to callee-save %rbp to (%rsp-8)? Any pointers how we
> manage the stack in callees? It would seem we need to step back more
> than -8 to dodge callee-saves, but how much?
> 
> It seems odd this affects Haswell so much. I've checked on my
> SandyBridge laptop, and we have the same code, but performance is
> consistent. Barring that, it would seem like some the second-order
> microarchitectural effect on Haswell. ...which makes me say this is the
> mode we should switch to:
> 
>> ...or "lock addl (%esp-CL-8), 0), pessimistically padding away from
>> stack users:
>>   http://cr.openjdk.java.net/~shade/8050147/webrev.02/

Ping. Anyone?

-Aleksey.