RFR: 8334431: C2 SuperWord: fix performance regression due to store-to-load-forwarding failures [v5]
Quan Anh Mai
qamai at openjdk.org
Tue Nov 19 15:42:22 UTC 2024
On Tue, 19 Nov 2024 15:18:14 GMT, Emanuel Peter <epeter at openjdk.org> wrote:
>> **History**
>> This issue became apparent with https://github.com/openjdk/jdk/pull/21521 / [JDK-8325155](https://bugs.openjdk.org/browse/JDK-8325155):
>> On machines that do not support sha intrinsics, we execute the sha code in java code. This java code has a loop that previously did not vectorize, but it now does since https://github.com/openjdk/jdk/pull/21521 / [JDK-8325155](https://bugs.openjdk.org/browse/JDK-8325155). It turns out that that kind of loop is actually slower when vectorized - this led to a regression, reported originally as:
>> `8334431: Regression 18-20% on Mac x64 on Crypto.signverify`
>>
>> I then investigated the issue thoroughly, and discovered that it was even an issue before https://github.com/openjdk/jdk/pull/21521 / [JDK-8325155](https://bugs.openjdk.org/browse/JDK-8325155). I wrote a [blog-post ](https://eme64.github.io/blog/2024/06/24/Auto-Vectorization-and-Store-to-Load-Forwarding.html) about the issue.
>>
>> **Summary of Problem**
>>
>> As described in the [blog-post ](https://eme64.github.io/blog/2024/06/24/Auto-Vectorization-and-Store-to-Load-Forwarding.html), vectorization can introduce store-to-load failures that were not present in the scalar loop code. Where in scalar code, the loads and stores were all exactly overlapping or non-overlapping, in vectorized code they can now be partially overlapping. When a store and a later load are partially overlapping, the store value cannot be directly forwarded from the store-buffer to the load (would be fast), but has to first go to L1 cache. This incurs a higher latency on the dependency edge from the store to the load.
>>
>> **Benchmark**
>>
>> I introduced a new micro-benchmark in https://github.com/openjdk/jdk/pull/19880, and now further expanded it in this PR. You can see the extensive results in [this comment below](https://github.com/openjdk/jdk/pull/21521#issuecomment-2458938698).
>>
>> The benchmarks look different on different machines, but they all have a pattern similar to this:
>> 
>> 
>> 
>> 
>>
>> We see that the `scalar` loop is faster for low `offset`, and the `vectorized` loop is faster for high offsets (and power-of-w offse...
>
> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision:
>
> more for Christian
test/micro/org/openjdk/bench/vm/compiler/VectorStoreToLoadForwarding.java line 91:
> 89: }
> 90:
> 91: @CompilerControl(CompilerControl.Mode.DONT_INLINE)
Err are you sure this works I think this should be `FORCE_INLINE` instead. I see you want to have different `SIZE`s, too. Then you can make a `MutableCallSite` for each parameter. The magic here is that the compiler will treat the call target as a constant and force a recompilation each time you call `setTarget`
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/21521#discussion_r1848587591
More information about the hotspot-compiler-dev
mailing list