RFR: 8355094: Performance drop in auto-vectorized kernel due to split store

Thu May 15 08:29:52 UTC 2025

On Tue, 6 May 2025 13:21:30 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> **Summary**
> 
> Before [JDK-8325155](https://bugs.openjdk.org/browse/JDK-8325155) / https://github.com/openjdk/jdk/pull/18822, we used to prefer aligning to stores. But in that change, I removed that preference, and since then we have been aligning to loads instead (there is no preference, but since loads usually come before stores in the loop body, the load gets picked). This lead to a performance regression, especially on `x64`.
> 
> Especially on `x64`, it is more important to align stores than aligning loads. This is because memory operations that cross a cacheline boundary are split. And `x64` CPU's generally have more throughput for loads than for stores, so splitting a store is worse than splitting a load.
> 
> On `aarch64`, the results are less clear. On two machines, the differences were marginal, but surprisingly aligning to loads was marginally faster. On another machine, aligning to stores was significantly faster. I suspect performance depends on the exact `aarch64` implementation. I'm not an `aarch64` specialist, and only have access to a limited number of machines.
> 
> **Fix**: make automatic alignment configurable with `SuperWordAutomaticAlignment` (no alignment, align to store, align to load). Default is align to store.
> 
> For now, I will just align to stores on all platforms. If someone has various `aarch64` machines, they are welcome do do deeper investigations. Same for other platforms. We could always turn the flag into a platform dependent one, and set different defaults depending on the exact CPU.
> 
> If you are interested you can read my investigations/benchmark results below. Therre are a lot of colorful plots 📈 😊 
> 
> **FYI about Vector API:** if you are working with the Vector API, you may also want to worry about **alignment**, because there can be a **significant performance impact** (30%+ in some cases). You may also want to know about **4k aliasing**, discussed below.
> 
> **Shoutout:**
> - @jatin-bhateja filed the regression, and explained that it was about split stores.
> - @mhaessig helped me talk through some of the early benchmarks.
> - @iwanowww pointed me to the 4k aliasing explanation.
> 
> --------------------
> 
> **Introduction**
> 
> I had long lived with the **theory that on modern CPUs, misalignment has no consequence, especially no performance impact**. When you google, many sources say that misalignment used to be an issue on older CPUs, but not any more.
> 
> That may **technically** be true:
> - A misaligned load or store that does not cross a cacheline b...

Thank you for the deep investigation, the excellent report, and most of all the colorful plots!

I found a typo, but otherwise the hotspot changes look good to me. I cannot review the benchmarks, unfortunately.

src/hotspot/share/opto/superword.cpp line 2676:

> 2674:     // it is worse if a store is split, and less bad if a load is split.
> 2675:     // By default, we have SuperWordAutomaticAlignment=1, i.e. we align with a
> 2676:     // load if possible, to avoid splitting that load.

Suggestion:

    // By default, we have SuperWordAutomaticAlignment=1, i.e. we align with a
    // store if possible, to avoid splitting that store.

That conflicts with what the documentation in `c2_globals.hpp` says.

-------------

PR Review: https://git.openjdk.org/jdk/pull/25065#pullrequestreview-2842727472
PR Review Comment: https://git.openjdk.org/jdk/pull/25065#discussion_r2090573854