RFR: 8355094: Performance drop in auto-vectorized kernel due to split store [v2]

Thu May 15 21:27:52 UTC 2025

On Thu, 15 May 2025 09:21:34 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> **Summary**
>> 
>> Before [JDK-8325155](https://bugs.openjdk.org/browse/JDK-8325155) / https://github.com/openjdk/jdk/pull/18822, we used to prefer aligning to stores. But in that change, I removed that preference, and since then we have been aligning to loads instead (there is no preference, but since loads usually come before stores in the loop body, the load gets picked). This lead to a performance regression, especially on `x64`.
>> 
>> Especially on `x64`, it is more important to align stores than aligning loads. This is because memory operations that cross a cacheline boundary are split. And `x64` CPU's generally have more throughput for loads than for stores, so splitting a store is worse than splitting a load.
>> 
>> On `aarch64`, the results are less clear. On two machines, the differences were marginal, but surprisingly aligning to loads was marginally faster. On another machine, aligning to stores was significantly faster. I suspect performance depends on the exact `aarch64` implementation. I'm not an `aarch64` specialist, and only have access to a limited number of machines.
>> 
>> **Fix**: make automatic alignment configurable with `SuperWordAutomaticAlignment` (no alignment, align to store, align to load). Default is align to store.
>> 
>> For now, I will just align to stores on all platforms. If someone has various `aarch64` machines, they are welcome do do deeper investigations. Same for other platforms. We could always turn the flag into a platform dependent one, and set different defaults depending on the exact CPU.
>> 
>> If you are interested you can read my investigations/benchmark results below. Therre are a lot of colorful plots 📈 😊 
>> 
>> **FYI about Vector API:** if you are working with the Vector API, you may also want to worry about **alignment**, because there can be a **significant performance impact** (30%+ in some cases). You may also want to know about **4k aliasing**, discussed below.
>> 
>> **Shoutout:**
>> - @jatin-bhateja filed the regression, and explained that it was about split stores.
>> - @mhaessig helped me talk through some of the early benchmarks.
>> - @iwanowww pointed me to the 4k aliasing explanation.
>> 
>> --------------------
>> 
>> **Introduction**
>> 
>> I had long lived with the **theory that on modern CPUs, misalignment has no consequence, especially no performance impact**. When you google, many sources say that misalignment used to be an issue on older CPUs, but not any more.
>> 
>> That may **technically** be true:
>> - A ...
>
> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Update src/hotspot/share/opto/superword.cpp
>   
>   Co-authored-by: Manuel Hässig <manuel at haessig.org>

Impressive analysis, Emanuel! Very deep, thorough, and insightful.

Looks good.

Speaking of Vector API, we experimented with getting access alignment under control.  Unfortunately, when it comes to on-heap accesses it boils down to hyper-aligned objects support which is not there yet.

PS: yay, you found a way to turn PRs into blog posts! :-)

-------------

Marked as reviewed by vlivanov (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/25065#pullrequestreview-2845004209