RFR: 8334431: C2 SuperWord: fix performance regression due to store-to-load-forwarding failures [v7]

Tue Nov 19 16:03:26 UTC 2024

> **History**
> This issue became apparent with https://github.com/openjdk/jdk/pull/21521 / [JDK-8325155](https://bugs.openjdk.org/browse/JDK-8325155):
> On machines that do not support sha intrinsics, we execute the sha code in java code. This java code has a loop that previously did not vectorize, but it now does since https://github.com/openjdk/jdk/pull/21521 / [JDK-8325155](https://bugs.openjdk.org/browse/JDK-8325155). It turns out that that kind of loop is actually slower when vectorized - this led to a regression, reported originally as:
> `8334431: Regression 18-20% on Mac x64 on Crypto.signverify`
> 
> I then investigated the issue thoroughly, and discovered that it was even an issue before https://github.com/openjdk/jdk/pull/21521 / [JDK-8325155](https://bugs.openjdk.org/browse/JDK-8325155). I wrote a [blog-post ](https://eme64.github.io/blog/2024/06/24/Auto-Vectorization-and-Store-to-Load-Forwarding.html) about the issue.
> 
> **Summary of Problem**
> 
> As described in the [blog-post ](https://eme64.github.io/blog/2024/06/24/Auto-Vectorization-and-Store-to-Load-Forwarding.html), vectorization can introduce store-to-load failures that were not present in the scalar loop code. Where in scalar code, the loads and stores were all exactly overlapping or non-overlapping, in vectorized code they can now be partially overlapping. When a store and a later load are partially overlapping, the store value cannot be directly forwarded from the store-buffer to the load (would be fast), but has to first go to L1 cache. This incurs a higher latency on the dependency edge from the store to the load.
> 
> **Benchmark**
> 
> I introduced a new micro-benchmark in https://github.com/openjdk/jdk/pull/19880, and now further expanded it in this PR. You can see the extensive results in [this comment below](https://github.com/openjdk/jdk/pull/21521#issuecomment-2458938698).
> 
> The benchmarks look different on different machines, but they all have a pattern similar to this:
> ![image](https://github.com/user-attachments/assets/3366f7fa-af44-44d4-a476-8cd0466fe937)
> ![image](https://github.com/user-attachments/assets/1c1408c2-053e-4a8a-ad46-32b75b836161)
> ![image](https://github.com/user-attachments/assets/d392c8cf-fb62-4593-93c7-a0d85ad5885e)
> ![image](https://github.com/user-attachments/assets/3a79601f-4015-4f71-a510-cab7d7b59ed8)
> 
> We see that the `scalar` loop is faster for low `offset`, and the `vectorized` loop is faster for high offsets (and power-of-w offsets).
> 
> The reason is that for low offsets, th...

Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision:

  more examples for Christian

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/21521/files
  - new: https://git.openjdk.org/jdk/pull/21521/files/2d98fd1c..7a8f365e

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=21521&range=06
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=21521&range=05-06

  Stats: 50 lines in 1 file changed: 43 ins; 0 del; 7 mod
  Patch: https://git.openjdk.org/jdk/pull/21521.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/21521/head:pull/21521

PR: https://git.openjdk.org/jdk/pull/21521