RFR (M) CR 8050147: StoreLoad barrier interferes with stack usages
Aleksey Shipilev
aleksey.shipilev at oracle.com
Wed Jul 23 18:38:51 UTC 2014
On 07/23/2014 01:38 AM, Aleksey Shipilev wrote:
> * 1x4x1 Intel Core (Haswell-DT) benefits from offsetting %esp as well.
> There is an interesting hump on lower backoffs with addl (%esp-8), which
> seems to be a second-order microarchitectural effect. Unfortunately, we
> don't have large Haswells available at the moment to dive into this:
> http://cr.openjdk.java.net/~shade/8050147/micros/Intel-Haswell.data.png
Finally managed to get the profile on this Haswell. The hottest loops
are identical, although the second variant is consistently better.
This produces 1200.318 +- 18.520 ns/op:
2.65% 0x00007fdcb87190b0: mov (%rsp),%r10
0x00007fdcb87190b4: mov 0xc(%r10),%r10d
0.06% 0x00007fdcb87190b8: mov %r10d,0xc(%rbp)
2.66% 0x00007fdcb87190bc: lock addl $0x0,-0x8(%rsp)
80.08% 95.59% 0x00007fdcb87190c2: mov %rbp,%rsi
0x00007fdcb87190c5: xchg %ax,%ax
0x00007fdcb87190c7: callq 0x00007fdcb85d7fe0
0x00007fdcb87190cc: test %eax,%eax
0x00007fdcb87190ce: jne 0x00007fdcb87190b0
This produces 891.427 +- 15.277 ns/op:
3.52% 0.02% 0x00007ffa3918f030: mov (%rsp),%r10
0x00007ffa3918f034: mov 0xc(%r10),%r10d
0.07% 0x00007ffa3918f038: mov %r10d,0xc(%rbp)
3.68% 0x00007ffa3918f03c: lock addl $0x0,-0x48(%rsp)
76.54% 97.13% 0x00007ffa3918f042: mov %rbp,%rsi
0x00007ffa3918f045: xchg %ax,%ax
0x00007ffa3918f047: callq 0x00007ffa39045fe0
0x00007ffa3918f04c: test %eax,%eax
0x00007ffa3918f04e: jne 0x00007ffa3918f030
Full logs here:
http://cr.openjdk.java.net/~shade/8050147/haswell-barrier2.perfasm
http://cr.openjdk.java.net/~shade/8050147/haswell-barrier3.perfasm
It is puzzling to me why do we have the difference here. In the logs
there, you may see the second-hottest method, looping(), which is called
in this busy loop. It does:
2.60% 0x00007fdcb871b200: sub $0x18,%rsp
0x00007fdcb871b207: mov %rbp,0x10(%rsp)
...which seems to callee-save %rbp to (%rsp-8)? Any pointers how we
manage the stack in callees? It would seem we need to step back more
than -8 to dodge callee-saves, but how much?
It seems odd this affects Haswell so much. I've checked on my
SandyBridge laptop, and we have the same code, but performance is
consistent. Barring that, it would seem like some the second-order
microarchitectural effect on Haswell. ...which makes me say this is the
mode we should switch to:
> ...or "lock addl (%esp-CL-8), 0), pessimistically padding away from
> stack users:
> http://cr.openjdk.java.net/~shade/8050147/webrev.02/
Thanks,
-Aleksey.
More information about the hotspot-compiler-dev
mailing list