RFR (M) CR 8050147: StoreLoad barrier interferes with stack usages

Tue Jul 22 22:08:29 UTC 2014

On 2014-7-22, at 5:38 PM, Aleksey Shipilev <aleksey.shipilev at oracle.com> wrote:

> (the patch itself is XS, but the explanation is M)
> 
> Hi,
> 
> This is a follow up for the issue discovered earlier. In a tight
> performance-sensitive loops StoreLoad barrier in the form of "lock addl
> (%esp+0), 0" interferes with stack users:
>  https://bugs.openjdk.java.net/browse/JDK-8050147
> 
> I used the experimental patch:
>  https://bugs.openjdk.java.net/browse/JDK-8050149
> 
> ...to juggle different StoreLoad barrier strategies:
> 
>  1) mfence:
>       we know it is slow on almost all platforms, keep this as control

Admittedly old data, but related : https://blogs.oracle.com/dave/entry/instruction_selection_for_volatile_fences

Over time, mfence has been freighted with additional semantics.

For 64-bit mode it’d be useful to try the “xchg rThread, rThread->Self” idiom, where the Self field points to the enclosing thread structure in a self-referential fashion.   I’ve seen good results from that form.  

And of course if we’re willing to kill a register — a register that might be on the verge of becoming dead anyway — then replacing [ST; fence-idiom] with XCHG might be interesting.

Regards
Dave 

> 
>  2) lock addl (%esp), 0:
>       current default
> 
>  3) lock addl (%esp-8), 0:
>       unties the data dependency against (%esp) users
> 
>  4) lock addl (%esp-CacheLine-8), 0:
>       unties the data dependency, and also steps back a cache line
>       to untie possible memory contention of (%esp) users and our
>       locked instructions
> 
>  5) lock addl (%esp-RedZone-8), 0:
>       Not sure how it interacts with our calling convention, but
>       System V ABI tells us there is a "red zone" 128 bytes down the
>       stack pointer which is always preserved by interrupt handlers,
>       etc. It seems we can use red zone for scratch data. Dave Dice
>       suggests that idempotent operations in red zone are benign
>       anyway. But in case we have a problem with the red zone, we can
>       fallback to this mode.
> 
> Targeted benchmarks triage that the issue only manifests in a tight
> loops where the users of (%esp) are very close to the StoreLoad barrier.
> By carefully backing off between the StoreLoad barrier and the users of
> (%esp), we were able to quantify where different StoreLoad strategies
> benefit. We use the same targeted benchmark (two prebuilt JARs are also
> there):
>  http://cr.openjdk.java.net/~shade/8050147/micros/VolatileBarrierBench.java
> 
> Running this on different machine architectures yields more or less
> consistent result among the strategies. The links below point to charted
> data in PNG format, but they are also available in SVG, as well in raw
> data form, in the same folder. The graphs show how the throughput of
> volatile write + backoff changes with different backoffs. Lines are
> different StoreLoad strategies. Tiles correspond to the number of
> measurement threads doing the loop.
> 
> * 1x1x2 Intel Atom Z530 seems completely oblivious of memory barrier
> issue. This seems to be due the generated code in 32-bit mode which
> completely pads away the StoreLoad barrier costs -- notice tens of
> nanoseconds per write:
> 
> http://cr.openjdk.java.net/~shade/8050147/micros/Intel-Atom-client.data.png
> 
> http://cr.openjdk.java.net/~shade/8050147/micros/Intel-Atom-server.data.png
> 
> * 2x16 AMD Opteron (Abu Dhabi) benefits greatly with offsetting %esp.
> We can also quantify the area where the interference occurs. It spans
> the area in backoff [0..10], which is loosely [0..30] instructions
> between the stack user and StoreLoad barrier:
>  http://cr.openjdk.java.net/~shade/8050147/micros/AMD-Abudhabi.data.png
> 
> * 2x16 AMD Opteron (Interlagos) tests paint the same picture (it is
> remarkable that mfence is consistently behind):
>  http://cr.openjdk.java.net/~shade/8050147/micros/AMD-Interlagos.data.png
> 
> * 1x4x1 Intel Core (Haswell-DT) benefits from offsetting %esp as well.
> There is an interesting hump on lower backoffs with addl (%esp-8), which
> seems to be a second-order microarchitectural effect. Unfortunately, we
> don't have large Haswells available at the moment to dive into this:
>  http://cr.openjdk.java.net/~shade/8050147/micros/Intel-Haswell.data.png
> 
> * 1x2x2 Intel Core (Sandy Bridge), anecdotal evidence from my laptop
> also shows offsetting %esp helps:
> 
> https://bugs.openjdk.java.net/browse/JDK-8050147?focusedCommentId=13521925#comment-13521925
> 
> Our large reference workloads running on reference performance servers
> show no statistically significant improvements/regressions with either
> mode. I think this is because a) in large workloads the padding between
> stack users and StoreLoads is beefy; and b) there is not so many
> StoreLoads in performance benchmarks (survivorship bias).
> 
> Selected ForkJoin-rich microbenchmarks show the good improvement for all
> (%esp-offset) modes.
> 
> Having this data, I propose we switch either to "lock addl (%esp-8), 0",
> optimistically thinking there are no second-order effects with sharing
> the cache lines:
>  http://cr.openjdk.java.net/~shade/8050147/webrev.01/
> 
> ...or "lock addl (%esp-CL-8), 0), pessimistically padding away from
> stack users:
>  http://cr.openjdk.java.net/~shade/8050147/webrev.02/
> 
> Both patches pass full JPRT cycle.
> 
> Thoughts?
> 
> Thanks,
> -Aleksey.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20140722/b2547e00/attachment.html>