RFR: 8334431: C2 SuperWord: fix performance regression due to store-to-load-forwarding failures

Thu Nov 7 06:56:15 UTC 2024

On Tue, 15 Oct 2024 11:33:04 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> **History**
> This issue became apparent with https://github.com/openjdk/jdk/pull/21521 / [JDK-8325155](https://bugs.openjdk.org/browse/JDK-8325155):
> On machines that do not support sha intrinsics, we execute the sha code in java code. This java code has a loop that previously did not vectorize, but it now does since https://github.com/openjdk/jdk/pull/21521 / [JDK-8325155](https://bugs.openjdk.org/browse/JDK-8325155). It turns out that that kind of loop is actually slower when vectorized - this led to a regression, reported originally as:
> `8334431: Regression 18-20% on Mac x64 on Crypto.signverify`
> 
> I then investigated the issue thoroughly, and discovered that it was even an issue before https://github.com/openjdk/jdk/pull/21521 / [JDK-8325155](https://bugs.openjdk.org/browse/JDK-8325155). I wrote a [blog-post ](https://eme64.github.io/blog/2024/06/24/Auto-Vectorization-and-Store-to-Load-Forwarding.html) about the issue.
> 
> **Summary of Problem**
> 
> As described in the [blog-post ](https://eme64.github.io/blog/2024/06/24/Auto-Vectorization-and-Store-to-Load-Forwarding.html), vectorization can introduce store-to-load failures that were not present in the scalar loop code. Where in scalar code, the loads and stores were all exactly overlapping or non-overlapping, in vectorized code they can now be partially overlapping. When a store and a later load are partially overlapping, the store value cannot be directly forwarded from the store-buffer to the load (would be fast), but has to first go to L1 cache. This incurs a higher latency on the dependency edge from the store to the load.
> 
> **Benchmark**
> 
> I introduced a new micro-benchmark in https://github.com/openjdk/jdk/pull/19880, and now further expanded it in this PR. You can see the extensive results in [this comment below](https://github.com/openjdk/jdk/pull/21521#issuecomment-2458938698).
> 
> The benchmarks look different on different machines, but they all have a pattern similar to this:
> ![image](https://github.com/user-attachments/assets/3366f7fa-af44-44d4-a476-8cd0466fe937)
> ![image](https://github.com/user-attachments/assets/1c1408c2-053e-4a8a-ad46-32b75b836161)
> ![image](https://github.com/user-attachments/assets/d392c8cf-fb62-4593-93c7-a0d85ad5885e)
> ![image](https://github.com/user-attachments/assets/3a79601f-4015-4f71-a510-cab7d7b59ed8)
> 
> We see that the `scalar` loop is faster for low `offset`, and the `vectorized` loop is faster for high offsets (and power-of-w offsets).
> 
> The reason is that for low offsets, th...

I'm experimenting now. I have taken my benchmark from https://github.com/openjdk/jdk/pull/19880, and extended it a little. Here the full results in the [PDF](https://github.com/user-attachments/files/17518027/table2.pdf). And here some charts:

![image](https://github.com/user-attachments/assets/3366f7fa-af44-44d4-a476-8cd0466fe937)
![image](https://github.com/user-attachments/assets/1c1408c2-053e-4a8a-ad46-32b75b836161)
![image](https://github.com/user-attachments/assets/d392c8cf-fb62-4593-93c7-a0d85ad5885e)
![image](https://github.com/user-attachments/assets/3a79601f-4015-4f71-a510-cab7d7b59ed8)

I ran this on my avx521 machine, so results may vary on different platforms - especially with different vector-lengths and different store-to-load-forwarding mechanisms. But on my machine, it is pretty clear that the cut-off is at about an offset of 32.

I explain it like this: with an offset smaller than 32, the latency is the main issue: the store-to-load-forwarding failures incur a higher latency on that store-load "edge", and that shows in the final runtime. But if the offset is larger than 32, then we have limitation on throughput: we can only run so many scalar ops per cycle. But if we turn them into vector ops, we have fewer ops, and so we are faster.

Of course, optimally we would have some sort of cost model that takes into account both latency and throughput. But that is for Future Work. For now, we need some cut-off heuristic. And it looks like - at least for avx512 - the heuristic is that we must check if there is any store-to-load-forwarding failure within 32 (virtually unrolled?) iterations. Of course this will not be fully accurate - hand-unrolling and a number of other factors can confuse this heuristic.

Now I also ran it on a ASIMD aarch64 machine. Here the [PDF](https://github.com/user-attachments/files/17522538/table_aarch64.pdf).

![image](https://github.com/user-attachments/assets/ccaa73b0-1659-4ead-873c-39cc7c9b4e53)
![image](https://github.com/user-attachments/assets/a662e555-66f2-43ac-9d43-66e2a739a108)
![image](https://github.com/user-attachments/assets/d8dd329e-19e7-4e3b-8e25-b8d69aacc2ac)
![image](https://github.com/user-attachments/assets/9cee426b-46a3-4359-a547-d20d8d03dbce)

A few observations.
- The short benchmark is a bit noisy. Maybe someone else got on to that machine while I was running the benchmark. But everything else looks quite clean and nice, so I won't re-run the benchmark again now.
- In all 4 plots, we see a similar pattern: with smaller than offset 8, there is some few instances where vectorization is slower, but with higher offset, vectorization seems to always pay off.
- We see a similar "stepping" pattern with `byte`, a little with `short`, and not much at all with `int` and `long`.
- The vector size on that machine is only 16 bytes, so the throughput difference on the `long` benchmark can only be a factor 2x, with `int` 4x, with `short` 8x and with `byte` 16x. That seems to roughly show true with high offsets.

If I had to come up with a hypothesis, I would say that the cut-off is at `X` iterations, where `X = MaxVectorSize / 2`. I need to confirm that with different machines, maybe AVX2.

And now I also ran it on an `AVX2` machine (Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz). Here the [PDF](https://github.com/user-attachments/files/17586850/ghost11_avx2_table_v2.pdf).

![image](https://github.com/user-attachments/assets/bacb12ca-dbb3-4ae7-abab-f0690a93f6b1)
![image](https://github.com/user-attachments/assets/f1e321d6-4589-4c4a-9640-755146fe0961)
![image](https://github.com/user-attachments/assets/73ef7756-821c-4142-8ad3-faa6121b78a5)
![image](https://github.com/user-attachments/assets/4defda17-4408-4750-be50-0b3f31f69e44)

It looks like the cut-off is consistently at 16 iterations, though 32 iterations would be fine as well.

**Some Thoughts**

Every hardware will behave different. It depends on latency and throughput. The latency depends on the L1 cache latency, especially for the store-to-load-forwarding failures. And the throughput depends on the vector length, and the number of ports that can execute the instructions. This is quite complex, and would require fine-tuning.

For now, I will just have to set a hard limit, which is going to be inaccurate. But it is probably better on average than doing nothing for now.

Now I'm trying to consider how to set the iteration threshold. The benchmark here is a very simple case, and it is (to my understanding) maximally sensitive to store-to-load-forwarding failure latency: we only perform load, add and store. If the loop contained more other instructions that could be parallelized, then we would be more quickly limited by throughput. Hence, vectorization would be profitable earlier, i.e. for lower iteration thresholds. I would therefore wager that it is better to err on the lower side, and set the iteration threshold lower than the `StoreToLoadForwarding` benchmark indicates.

Thus, I will set the iteration threshold at `16` for `x86` (and by default for all platforms), and at `8` for `aarch64`.

Of course the iteration threshold is in the current implementation only a lower bound, we cannot at this point avoid having more iterations than the threshold in the unrolled loop at the time we vectorize. If there are more iterations in the loop, the threshold is effectively higher.

I've run benchmarks on 7 machines now. Here my [micro.ods](https://github.com/user-attachments/files/17643625/micro.ods).

scalar: SuperWord disabled
no_detect: SuperWord without detecting store-to-load-forwarding failures (old behaviour, before this patch)
default: new default SuperWord behaviour (detect store-to-load-forwarding failures for small offsets -> disable vectorization if detected)

**Conclusion**
For this benchmark, it seems the new behaviour (`default`) is very accurately chosing the best options between `scalar` and `no_detect`. The only exception is on the `windows x64` machine: for ints and longs in the offset range from 17-31 `default` decides to vectorize (same as `no_detect`), where the `scalar` option would have been a little faster. But no heuristic is perfect, and as said above: this benchmark is maximally sensitive to latency, and if the amount of work per iteration was increased, then we should expect the balance to tip towards vectorization being preferrable.

x64 AVX2 machine
![image](https://github.com/user-attachments/assets/c234f922-aab9-4c1d-bade-9d9ce1363372)

OCI aach64 (asimd / neon)
![image](https://github.com/user-attachments/assets/cd1e6bf2-a5c5-42d7-9690-c13a0474313b)

Linux aarch64 (asimd / neon)
![image](https://github.com/user-attachments/assets/488e0c12-9f2a-4601-8f06-d976531b9d7e)

Linux x64
![image](https://github.com/user-attachments/assets/24d04e84-7206-4ce6-9576-41c3090ba326)

MacOSX aarch64 (asimd / neon)
![image](https://github.com/user-attachments/assets/c33fa869-3a06-4fff-af9c-fcb794e2f1a1)

MacOSX x64
![image](https://github.com/user-attachments/assets/93f16e92-f2e6-44b9-a0f4-2223ef7c619a)

Windows x64
![image](https://github.com/user-attachments/assets/79221029-9733-4488-ab27-654674b44c03)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21521#issuecomment-2437135550
PR Comment: https://git.openjdk.org/jdk/pull/21521#issuecomment-2437764165
PR Comment: https://git.openjdk.org/jdk/pull/21521#issuecomment-2449602737
PR Comment: https://git.openjdk.org/jdk/pull/21521#issuecomment-2449611215
PR Comment: https://git.openjdk.org/jdk/pull/21521#issuecomment-2451423352
PR Comment: https://git.openjdk.org/jdk/pull/21521#issuecomment-2458938698