RFR (M) CR 8050147: StoreLoad barrier interferes with stack usages

Wed Jul 23 18:38:51 UTC 2014

On 07/23/2014 01:38 AM, Aleksey Shipilev wrote:
>  * 1x4x1 Intel Core (Haswell-DT) benefits from offsetting %esp as well.
> There is an interesting hump on lower backoffs with addl (%esp-8), which
> seems to be a second-order microarchitectural effect. Unfortunately, we
> don't have large Haswells available at the moment to dive into this:
>   http://cr.openjdk.java.net/~shade/8050147/micros/Intel-Haswell.data.png

Finally managed to get the profile on this Haswell. The hottest loops
are identical, although the second variant is consistently better.

This produces 1200.318 +- 18.520 ns/op:

  2.65%             0x00007fdcb87190b0: mov    (%rsp),%r10
                    0x00007fdcb87190b4: mov    0xc(%r10),%r10d
  0.06%             0x00007fdcb87190b8: mov    %r10d,0xc(%rbp)
  2.66%             0x00007fdcb87190bc: lock addl $0x0,-0x8(%rsp)
 80.08%   95.59%    0x00007fdcb87190c2: mov    %rbp,%rsi
                    0x00007fdcb87190c5: xchg   %ax,%ax
                    0x00007fdcb87190c7: callq  0x00007fdcb85d7fe0
                    0x00007fdcb87190cc: test   %eax,%eax
                    0x00007fdcb87190ce: jne    0x00007fdcb87190b0

This produces 891.427 +- 15.277 ns/op:

  3.52%    0.02%    0x00007ffa3918f030: mov    (%rsp),%r10
                    0x00007ffa3918f034: mov    0xc(%r10),%r10d
  0.07%             0x00007ffa3918f038: mov    %r10d,0xc(%rbp)
  3.68%             0x00007ffa3918f03c: lock addl $0x0,-0x48(%rsp)
 76.54%   97.13%    0x00007ffa3918f042: mov    %rbp,%rsi
                    0x00007ffa3918f045: xchg   %ax,%ax
                    0x00007ffa3918f047: callq  0x00007ffa39045fe0
                    0x00007ffa3918f04c: test   %eax,%eax
                    0x00007ffa3918f04e: jne    0x00007ffa3918f030

Full logs here:
 http://cr.openjdk.java.net/~shade/8050147/haswell-barrier2.perfasm
 http://cr.openjdk.java.net/~shade/8050147/haswell-barrier3.perfasm

It is puzzling to me why do we have the difference here. In the logs
there, you may see the second-hottest method, looping(), which is called
in this busy loop. It does:

 2.60%             0x00007fdcb871b200: sub    $0x18,%rsp
                   0x00007fdcb871b207: mov    %rbp,0x10(%rsp)

...which seems to callee-save %rbp to (%rsp-8)? Any pointers how we
manage the stack in callees? It would seem we need to step back more
than -8 to dodge callee-saves, but how much?

It seems odd this affects Haswell so much. I've checked on my
SandyBridge laptop, and we have the same code, but performance is
consistent. Barring that, it would seem like some the second-order
microarchitectural effect on Haswell. ...which makes me say this is the
mode we should switch to:

> ...or "lock addl (%esp-CL-8), 0), pessimistically padding away from
> stack users:
>   http://cr.openjdk.java.net/~shade/8050147/webrev.02/

Thanks,
-Aleksey.