RFR (M) CR 8050147: StoreLoad barrier interferes with stack usages

Tue Jul 22 21:38:03 UTC 2014

(the patch itself is XS, but the explanation is M)

Hi,

This is a follow up for the issue discovered earlier. In a tight
performance-sensitive loops StoreLoad barrier in the form of "lock addl
(%esp+0), 0" interferes with stack users:
  https://bugs.openjdk.java.net/browse/JDK-8050147

I used the experimental patch:
  https://bugs.openjdk.java.net/browse/JDK-8050149

...to juggle different StoreLoad barrier strategies:

  1) mfence:
       we know it is slow on almost all platforms, keep this as control

  2) lock addl (%esp), 0:
       current default

  3) lock addl (%esp-8), 0:
       unties the data dependency against (%esp) users

  4) lock addl (%esp-CacheLine-8), 0:
       unties the data dependency, and also steps back a cache line
       to untie possible memory contention of (%esp) users and our
       locked instructions

  5) lock addl (%esp-RedZone-8), 0:
       Not sure how it interacts with our calling convention, but
       System V ABI tells us there is a "red zone" 128 bytes down the
       stack pointer which is always preserved by interrupt handlers,
       etc. It seems we can use red zone for scratch data. Dave Dice
       suggests that idempotent operations in red zone are benign
       anyway. But in case we have a problem with the red zone, we can
       fallback to this mode.

Targeted benchmarks triage that the issue only manifests in a tight
loops where the users of (%esp) are very close to the StoreLoad barrier.
By carefully backing off between the StoreLoad barrier and the users of
(%esp), we were able to quantify where different StoreLoad strategies
benefit. We use the same targeted benchmark (two prebuilt JARs are also
there):
  http://cr.openjdk.java.net/~shade/8050147/micros/VolatileBarrierBench.java

Running this on different machine architectures yields more or less
consistent result among the strategies. The links below point to charted
data in PNG format, but they are also available in SVG, as well in raw
data form, in the same folder. The graphs show how the throughput of
volatile write + backoff changes with different backoffs. Lines are
different StoreLoad strategies. Tiles correspond to the number of
measurement threads doing the loop.

 * 1x1x2 Intel Atom Z530 seems completely oblivious of memory barrier
issue. This seems to be due the generated code in 32-bit mode which
completely pads away the StoreLoad barrier costs -- notice tens of
nanoseconds per write:

http://cr.openjdk.java.net/~shade/8050147/micros/Intel-Atom-client.data.png

http://cr.openjdk.java.net/~shade/8050147/micros/Intel-Atom-server.data.png

 * 2x16 AMD Opteron (Abu Dhabi) benefits greatly with offsetting %esp.
We can also quantify the area where the interference occurs. It spans
the area in backoff [0..10], which is loosely [0..30] instructions
between the stack user and StoreLoad barrier:
  http://cr.openjdk.java.net/~shade/8050147/micros/AMD-Abudhabi.data.png

 * 2x16 AMD Opteron (Interlagos) tests paint the same picture (it is
remarkable that mfence is consistently behind):
  http://cr.openjdk.java.net/~shade/8050147/micros/AMD-Interlagos.data.png

 * 1x4x1 Intel Core (Haswell-DT) benefits from offsetting %esp as well.
There is an interesting hump on lower backoffs with addl (%esp-8), which
seems to be a second-order microarchitectural effect. Unfortunately, we
don't have large Haswells available at the moment to dive into this:
  http://cr.openjdk.java.net/~shade/8050147/micros/Intel-Haswell.data.png

 * 1x2x2 Intel Core (Sandy Bridge), anecdotal evidence from my laptop
also shows offsetting %esp helps:

https://bugs.openjdk.java.net/browse/JDK-8050147?focusedCommentId=13521925#comment-13521925

Our large reference workloads running on reference performance servers
show no statistically significant improvements/regressions with either
mode. I think this is because a) in large workloads the padding between
stack users and StoreLoads is beefy; and b) there is not so many
StoreLoads in performance benchmarks (survivorship bias).

Selected ForkJoin-rich microbenchmarks show the good improvement for all
(%esp-offset) modes.

Having this data, I propose we switch either to "lock addl (%esp-8), 0",
optimistically thinking there are no second-order effects with sharing
the cache lines:
  http://cr.openjdk.java.net/~shade/8050147/webrev.01/

...or "lock addl (%esp-CL-8), 0), pessimistically padding away from
stack users:
  http://cr.openjdk.java.net/~shade/8050147/webrev.02/

Both patches pass full JPRT cycle.

Thoughts?

Thanks,
-Aleksey.