RFR: 8334431: C2 SuperWord: fix performance regression due to store-to-load-forwarding failures

Thu Nov 7 06:56:15 UTC 2024

**History**
This issue became apparent with https://github.com/openjdk/jdk/pull/21521 / [JDK-8325155](https://bugs.openjdk.org/browse/JDK-8325155):
On machines that do not support sha intrinsics, we execute the sha code in java code. This java code has a loop that previously did not vectorize, but it now does since https://github.com/openjdk/jdk/pull/21521 / [JDK-8325155](https://bugs.openjdk.org/browse/JDK-8325155). It turns out that that kind of loop is actually slower when vectorized - this led to a regression, reported originally as:
`8334431: Regression 18-20% on Mac x64 on Crypto.signverify`

I then investigated the issue thoroughly, and discovered that it was even an issue before https://github.com/openjdk/jdk/pull/21521 / [JDK-8325155](https://bugs.openjdk.org/browse/JDK-8325155). I wrote a [blog-post ](https://eme64.github.io/blog/2024/06/24/Auto-Vectorization-and-Store-to-Load-Forwarding.html) about the issue.

**Summary of Problem**

As described in the [blog-post ](https://eme64.github.io/blog/2024/06/24/Auto-Vectorization-and-Store-to-Load-Forwarding.html), vectorization can introduce store-to-load failures that were not present in the scalar loop code. Where in scalar code, the loads and stores were all exactly overlapping or non-overlapping, in vectorized code they can now be partially overlapping. When a store and a later load are partially overlapping, the store value cannot be directly forwarded from the store-buffer to the load (would be fast), but has to first go to L1 cache. This incurs a higher latency on the dependency edge from the store to the load.

**Benchmark**

I introduced a new micro-benchmark in https://github.com/openjdk/jdk/pull/19880, and now further expanded it in this PR. You can see the extensive results in [this comment below](https://github.com/openjdk/jdk/pull/21521#issuecomment-2458938698).

The benchmarks look different on different machines, but they all have a pattern similar to this:
![image](https://github.com/user-attachments/assets/3366f7fa-af44-44d4-a476-8cd0466fe937)
![image](https://github.com/user-attachments/assets/1c1408c2-053e-4a8a-ad46-32b75b836161)
![image](https://github.com/user-attachments/assets/d392c8cf-fb62-4593-93c7-a0d85ad5885e)
![image](https://github.com/user-attachments/assets/3a79601f-4015-4f71-a510-cab7d7b59ed8)

We see that the `scalar` loop is faster for low `offset`, and the `vectorized` loop is faster for high offsets (and power-of-w offsets).

The reason is that for low offsets, the latency dominates the runtime, and for high offsets the throughput dominates.
If there are store-to-load-failures from every iteration `i` -> `i+offset`, and we have a total of `n` iterations, then we have a chain of `n/offset` latencies. Hence, as the `offset` increases, this latency chain becomes smaller and smaller. As an example: `offset = 3`, the 3rd iteration depends on the 0th, the 6th on the 3rd, the 9th on the 6th, the 12th on the 9th ... all the way to the nth iteration.

**Current Solution: a new heuristic**

Any heuristic is going to be somewhat inaccurate, but we now want to fix this issue in JDK24, and so I'd rather have a quick solution that works most of the time, rather than a sophisticated solution that works almost always. The sophisticated solution would carefully compute the expected latency and throughput for both the scalar and vectorized loop, and pick the faster one. I hope to experiment with that in the future.

For now, we just implement a "hard cutoff": if we predict that there will be ANY store-to-load-forwarding failure within some `N` iterations, then we bailout of vectoirzation. This `N` can be configured with the new diagnostic flag `SuperWordStoreToLoadForwardingFailureDetection`. The benchmarks indicated that `x64` machines should have a value of `16`, and `aarch64 asimd/neon` machines a value of `8`. I do not know what the value should be on other machines ... I just guessed it to be `16`, but **platform maintainers are welcome to adjust this value** - my benchmarks may be a helpful guide. 

Note: we only detect store-to-load-forwarding failures when the loads and stores are known at compile time to go to the same memory object.

**Should someone experience performance regressions doe to this fix**: you can disable the detection by setting the diagnostic flag: `-XX:+UnlockDiagnosticVMOptions -XX:SuperWordStoreToLoadForwardingFailureDetection=0`. Maybe you just need to lower it from the default. Increasing it is probably not going to help - but why not try anyway.

**Tests**

I had to adapt some tests. Primarily `TestDependencyOffsets.java`, which I just refactored in  https://github.com/openjdk/jdk/pull/21541 to make these changes here easier.

**Performance Testing**

[I ran my benchmark on 7 machines](https://github.com/openjdk/jdk/pull/21521#issuecomment-2458938698), and the new heuristic seems to perform very well.

I also ran extensive performance testing, and I did not see any significant change.

This was the originally reported regression (MacOSX x64 - SPECjvm2008-Crypto.signverify-G1: specjvm2008):
![image](https://github.com/user-attachments/assets/ed68344a-c4aa-47b7-96a1-60c91faee503)
(drop from `promo-24-b1` to `promo-24-b2`)

And that seems to be fixed now:
![image](https://github.com/user-attachments/assets/394f7a44-5fb5-4217-bf0e-f5a585268f2a)

-------------

Commit messages:
 - Merge branch 'master' into JDK-8334431-V-store-to-load-forwarding
 - Merge branch 'master' into JDK-8334431-V-store-to-load-forwarding
 - fix whitespace
 - fix tests and build
 - fix store-to-load forward IR rules
 - updates before the weekend ... who knows if they are any good
 - refactor to iteration threshold
 - use jvmArgs again, and apply same fix as 8343345
 - revert to jvmArgsPrepend
 - manual merge
 - ... and 14 more: https://git.openjdk.org/jdk/compare/06d8216a...9b2efe1a

Changes: https://git.openjdk.org/jdk/pull/21521/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=21521&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8334431
  Stats: 4386 lines in 17 files changed: 4324 ins; 4 del; 58 mod
  Patch: https://git.openjdk.org/jdk/pull/21521.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/21521/head:pull/21521

PR: https://git.openjdk.org/jdk/pull/21521