RFR: 8334431: C2 SuperWord: fix performance regression due to store-to-load-forwarding failures [v3]

Tue Nov 19 14:54:55 UTC 2024

On Tue, 19 Nov 2024 13:58:14 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> **History**
>> This issue became apparent with https://github.com/openjdk/jdk/pull/21521 / [JDK-8325155](https://bugs.openjdk.org/browse/JDK-8325155):
>> On machines that do not support sha intrinsics, we execute the sha code in java code. This java code has a loop that previously did not vectorize, but it now does since https://github.com/openjdk/jdk/pull/21521 / [JDK-8325155](https://bugs.openjdk.org/browse/JDK-8325155). It turns out that that kind of loop is actually slower when vectorized - this led to a regression, reported originally as:
>> `8334431: Regression 18-20% on Mac x64 on Crypto.signverify`
>> 
>> I then investigated the issue thoroughly, and discovered that it was even an issue before https://github.com/openjdk/jdk/pull/21521 / [JDK-8325155](https://bugs.openjdk.org/browse/JDK-8325155). I wrote a [blog-post ](https://eme64.github.io/blog/2024/06/24/Auto-Vectorization-and-Store-to-Load-Forwarding.html) about the issue.
>> 
>> **Summary of Problem**
>> 
>> As described in the [blog-post ](https://eme64.github.io/blog/2024/06/24/Auto-Vectorization-and-Store-to-Load-Forwarding.html), vectorization can introduce store-to-load failures that were not present in the scalar loop code. Where in scalar code, the loads and stores were all exactly overlapping or non-overlapping, in vectorized code they can now be partially overlapping. When a store and a later load are partially overlapping, the store value cannot be directly forwarded from the store-buffer to the load (would be fast), but has to first go to L1 cache. This incurs a higher latency on the dependency edge from the store to the load.
>> 
>> **Benchmark**
>> 
>> I introduced a new micro-benchmark in https://github.com/openjdk/jdk/pull/19880, and now further expanded it in this PR. You can see the extensive results in [this comment below](https://github.com/openjdk/jdk/pull/21521#issuecomment-2458938698).
>> 
>> The benchmarks look different on different machines, but they all have a pattern similar to this:
>> ![image](https://github.com/user-attachments/assets/3366f7fa-af44-44d4-a476-8cd0466fe937)
>> ![image](https://github.com/user-attachments/assets/1c1408c2-053e-4a8a-ad46-32b75b836161)
>> ![image](https://github.com/user-attachments/assets/d392c8cf-fb62-4593-93c7-a0d85ad5885e)
>> ![image](https://github.com/user-attachments/assets/3a79601f-4015-4f71-a510-cab7d7b59ed8)
>> 
>> We see that the `scalar` loop is faster for low `offset`, and the `vectorized` loop is faster for high offsets (and power-of-w offse...
>
> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision:
> 
>   use constant method handle to make the benchmark smaller

Nice blog post, benchmarks, summary and explanation of the problem! The new heuristic looks reasonable and safer/simpler to use for JDK 24. Would be interesting to see if you could come up with a throughput and latency based heuristic/cost model at some point in the future.

I have some comments, mostly minor things.

src/hotspot/share/opto/vtransform.cpp line 154:

> 152: // Helper-class for VTransformGraph::has_store_to_load_forwarding_failure.
> 153: // It represents a memory region: [ptr, ptr + memory_size)
> 154: class VPointerRecord : public StackObj {

Not so sure about the name of this class. When first reading the code below (which I reviewed first), I found it difficult to understand its purpose without reading this class comment. How about `VMemoryRegion` or something like that to better show that it's about regions of memory and not single pointers?

test/hotspot/jtreg/compiler/loopopts/superword/TestCyclicDependency.java line 28:

> 26:  * @test
> 27:  * @bug 8298935
> 28:  * @summary Writing forward on array creates cyclic dependency

You could probably add the bug number of this bug to the `@bug` in the line above this one.

test/hotspot/jtreg/compiler/loopopts/superword/TestCyclicDependency.java line 36:

> 34:  *                   TestCyclicDependency
> 35:  * @run main/othervm -XX:+IgnoreUnrecognizedVMOptions -XX:+AlignVector -XX:+VerifyAlignVector
> 36:  *                   TestCyclicDependency

These flags will be ignored for the test VM. You should pass them as separate runs in `main()` with `TestFramework.runWithFlags()`.

test/micro/org/openjdk/bench/vm/compiler/VectorStoreToLoadForwarding.java line 137:

> 135:     })
> 136:     public static class NoVectorization extends VectorStoreToLoadForwarding {}
> 137: 

Suggestion:

-------------

Changes requested by chagedorn (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/21521#pullrequestreview-2445346045
PR Review Comment: https://git.openjdk.org/jdk/pull/21521#discussion_r1848489712
PR Review Comment: https://git.openjdk.org/jdk/pull/21521#discussion_r1848509941
PR Review Comment: https://git.openjdk.org/jdk/pull/21521#discussion_r1848508701
PR Review Comment: https://git.openjdk.org/jdk/pull/21521#discussion_r1848499954