RFR: 8355094: Performance drop in auto-vectorized kernel due to split store

Thu May 15 07:45:45 UTC 2025

**Summary**

Before [JDK-8325155](https://bugs.openjdk.org/browse/JDK-8325155) / https://github.com/openjdk/jdk/pull/18822, we used to prefer aligning to stores. But in that change, I removed that preference, and since then we have been aligning to loads instead (there is no preference, but since loads usually come before stores in the loop body, the load gets picked). This lead to a performance regression, especially on `x64`.

Especially on `x64`, it is more important to align stores than aligning loads. This is because memory operations that cross a cacheline boundary are split. And `x64` CPU's generally have more throughput for loads than for stores, so splitting a store is worse than splitting a load.

On `aarch64`, the results are less clear. On two machines, the differences were marginal, but surprisingly aligning to loads was marginally faster. On another machine, aligning to stores was significantly faster. I suspect performance depends on the exact `aarch64` implementation. I'm not an `aarch64` specialist, and only have access to a limited number of machines.

**Fix**: make automatic alignment configurable with `SuperWordAutomaticAlignment` (no alignment, align to store, align to load). Default is align to store.

For now, I will just align to stores on all platforms. If someone has various `aarch64` machines, they are welcome do do deeper investigations. Same for other platforms. We could always turn the flag into a platform dependent one, and set different defaults depending on the exact CPU.

If you are interested you can read my investigations/benchmark results below. Therre are a lot of colorful plots 📈 😊 

**FYI about Vector API:** if you are working with the Vector API, you may also want to worry about **alignment**, because there can be a **significant performance impact** (30%+ in some cases). You may also want to know about **4k aliasing**, discussed below.

**Shoutout:**
- @jatin-bhateja filed the regression, and explained that it was about split stores.
- @mhaessig helped me talk through some of the early benchmarks.
- @iwanowww pointed me to the 4k aliasing explanation.

--------------------

**Introduction**

I had long lived with the **theory that on modern CPUs, misalignment has no consequence, especially no performance impact**. When you google, many sources say that misalignment used to be an issue on older CPUs, but not any more.

That may **technically** be true:
- A misaligned load or store that does not cross a cacheline boundary has no performance difference to an aligned load or store that does not cross a cacheline boundary.
- But: **A misaligned load or store that crosses a cacheline boundary is slower** than a misaligned load or store that does not cross a cacheline boundary. The reason is that a load or store that crosses a cacheline boundary is split, which means we now have two memory accesses instead of one.

**So there is a connection**: alignment means the load or store cannot cross a cacheline boundary, assuming a cacheline is at least as long as the load / store (e.g. 64 byte cacheline and 64 byte load / store or smaller). Conversely, a misaligned load has a good chance to cross a cacheline boundary. Especially when we are auto vectorizing, we are accessing a contiguous block of memory, and so if our accesses are misaligned, we must cross the cacheline boundary at some point. Hence, **alignment has a performance impact in vectorization**.

If we have a load and a store, but because of relative misalignment we can only align one: is it better to align the load or the store? Generally, x64 CPUs have more throughput for loads than stores. Splitting loads means we have more loads, which is not as bad as splitting more stores and having more stores going through the CPU. Hence, in most cases, it is better to align the store, and accept that the load is split.

The above holds for `x64`, but on `aarch64` things are a little different / more complicated. For example, I found [JEP 315](https://openjdk.org/jeps/315), which mentions:
> Avoid unaligned memory access if needed. Some CPU implementations impose penalties when issuing load/store instructions across a 16-byte boundary, a dcache-line boundary, or have different optimal alignment for different load/store instructions (see, for example, the Cortex A53 guide). If the aligned versions of intrinsics do not slow down code execution on alignment-independent CPUs, it may be beneficial to improve address alignment to help those CPUs that do have some penalties, provided it does not significantly increase code complexity.

For the aarch64 machine I use, a Neoverse N1, the [N1 Optimization Guide](https://developer.arm.com/documentation/109896/latest/) says:

> 4.5 Load/Store alignment
>The Armv8.2-A architecture allows many types of load and store accesses to be arbitrarily
aligned. The Neoverse N1 handles most unaligned accesses without performance penalties.
However, there are cases which reduce bandwidth or incur additional latency, as described
below.
>- Load operations that cross a cache-line (64-byte) boundary.
>- Quad-word load operations that are not 4B aligned.
>- Store operations that cross a 16B boundary.

Checking in a few other manuals, it is mostly about the 64-byte cacheline boundary for loads, and the 16-byte boundary for stores. These chips have the `neon` vector instructions, which are at most 16-byte (128 bit).

>From this I would personally conclude that with full alignment to vector length, there should be maximum performance. But the results below make me question that, and it seems I don't have the full picture yet.

------------------------

**Initial investigation using the Vector API**

With the Vector API, we can produce code where we have direct control over what vector instructions are generated, and including their alignment. This means we can start with some experiments independent of the auto vectorizer.

I wrote a stand-alone `Benchmark.java`, you can find it at the end of this PR. I did not integrate it, because it is not very well suited for regression testing, rather for visualization only. For regression testing, I am integrating the benchmark `VectorAutoAlignment.java`. Still, I am also integrating `VectorAutoAlignmentVisualization.java`, which can be used to visualize the effect of alignment for the auto vectorizer only.

Consider the following method, where we can vary the alignment of the load and store with `offset_load` and `offset_store`, respectively:

    public static void test1L1SVector(int offset_load, int offset_store) {
        for (int i = 0; i < SIZE - 64 - GRID; i += SPECIES.length()) {
            var v = IntVector.fromArray(SPECIES, arr1, base1 + i + offset_load);
            v.intoArray(arr0, base0 + i + offset_store);
        }
    }

Let's start with a simple experiment, using `test1L1SVector`, with `SIZE = 2560` (produces clean results because not too many other effects) and `oneArray` (store at beginning of array, store in same array but SIZE elements later).

Below the results for my AVX512 machine, that supports up to 64 byte vectors. I show the results for 64, 32, 16 and 8 bytes, i.e. 16, 8, 4, and 2 ints per vector.
![image](https://github.com/user-attachments/assets/8daeddc2-3dd5-4c23-b436-b9ffd89d4da2)
x-axis ➡️: `offset_load`
y-axis ⬆️: `offset_store`
We can see how there is a very clear grid for every size, and that the grid repeats with the vector size, i.e. the number of elements per vector. We see that store-alignment alignment has a larger effect on performance than load-alignment. With 16 element vectors, we can even see a faint diagonal effect of relative alignment between the loads and stores, though I don't know the cause of that effect.
Further: we can see that the smaller the vectors, the less extreme the relative differences appear. For 16 element vectors, runtime varies from `7.5 ms` to `11.5 ms`, but for 2 element vectors it only varies from `24.8` to `29.5`. My theory is that this that for 64 byte vectors, every unaligned vector is split, leading to roughly a doubling of operations. But for 8 byte vectors, only every 8th crosses a cacheline boundary, and the effect of splitting is thus much smaller.

Something else that also is visible in these results: arrays are only 8-byte aligned. Every time I run the benchmark, e.g. for different vector lengths, the alignment of the base is different. Thus, the "lines" of the "grid" do not always align between different runs of these benchmarks. **This has a quite significant implication for vector api benchmarks**: if one does not control the alignment of the arrays, one might get drastically unstable measurements, the results can quickly vary very significantly.

The `neon / asimd` N1 aarch64 machine provides vectors up to 128 bits, so we can only display the results for 4 and 2 element vectors of ints:
![image](https://github.com/user-attachments/assets/a01cf707-d3b2-4fa2-946f-27a44e43d8cc)
x-axis ➡️: offset_load
y-axis ⬆️: offset_store
Strangely, it seems only load alignment has a significant effect. That is quite surprising.

------------------------

**Investigating performance loss when crossing cacheline boundary, using Vector API**

While the relative performance differences for different vector lengths already match the theory that only memory accesses that cross a cacheline boundary are split, we now show this effect with a special "skip" benchmark. I ran `Benchmark.java test1L1SVectorSkip 4 2560 oneArray`, i.e. with a "skip" benchmark where every int vector has 4 elements, and we skip every 4th vector: `[0  1  2  3 ][4  5  6  7 ][8  9 10 11][   skip   ]`. If the cacheline boundary lies where the we skip a vector, then we should have no performance loss compared to when we have perfect alignment. At least in theory 😉 

Left the result for my AVX512 laptop, right the machine for the N1 aarch64 machine:
![image](https://github.com/user-attachments/assets/e42a06d2-d07c-4e0d-af28-755e79cdbbac)

For comparison, the results from above, for the 4-element vectors without skipping:
![image](https://github.com/user-attachments/assets/5135e924-8f47-4339-8e6f-a87723f8f8be)

Generally, we see similar 4-element wide "bands" in both directions.

The results for the AVX512 machine are quite understandable, and very crisp: We have the same repeating grid as without skip, except that every 4th band in x and y direction is "skipped", i.e. has the same performance as when aligned. **It seems the theory perfectly applies for my AVX512 machine** 😀 

**But the aarch64 results are stranger:** In x direction, i.e. for loads, every 4th band has better performance. That seems to correspond to the cacheline boundary of 64 bytes, i.e. when we skip, there is no load splitting. In y direction, i.e. for stores, every 2nd band has better performance. This is surprising, because the non skipping benchmark did not show any effect in this direction. And it seems to indicate some 32 byte effect, which neither corresponds to the 64 byte cacheline (otherwise we would have to see some effect that only shows every 4th band), nor to the 16 byte store boundaries mentioned in the N1 Optimization Guide. **This is really confusing.** We could further investigate the behavior with different element sizes and vector sizes, and different skip methods.

------------------------

**Discovering 4k aliasing artifacts in benchmark, using the Vector API**

On my AVX512 machine, I found an effect that happens around `4k byte` boundaries, i.e. every `1024 ints`. For 64, 32 and 16 byte vectors, i.e. 16, 8, and 4 elements, and `SIZE = 2048`, so 8k bytes:
![image](https://github.com/user-attachments/assets/44468635-35c2-4cb3-a3be-577c5a2a5654)
In the lower half triangles, we see the normal grid pattern. Modulo 4k bytes, the loads are ahead of the stores, that may explain why there is not effect. But the upper half triangles have drastically worse performance. The grid is now diagonal, probably dominated by relative alignment rather than absolute alignment. Modulo 4k bytes, the loads are behind the stores - my theory is that this conflicts with the loads having to happen first.

I ran it on a larger grid (offsets from 0-127), and one can see that the effect slowly wares off (from red to orange) - ignore the noise, I had to lower the accuracy to complete this one in reasonable time:
![image](https://github.com/user-attachments/assets/a2f41cbf-d87c-4281-9372-1601dbfb3a7c)

But it seems on the aarch64 machine, I cannot find this `4k byte` boundary effect.

@iwanowww Pointed me to this [article about 4k aliasing](https://github.com/Kobzol/hardware-effects/blob/master/4k-aliasing/README.md). The reason is that store-to-load-forwarding at first only operates on the lowest 12 bits of the address, and when it later detects that the rest of the address does not matter, this incurs a penalty of a few cycles. Note: I only just learned about the effects of store-to-load-forwarding recently, see https://github.com/openjdk/jdk/pull/21521 / [JDK-8334431](https://bugs.openjdk.org/browse/JDK-8334431).

------------------------

**Investigation for automatic alignment in the Auto Vectorizer**

To be able to investigate the performance of the Auto-Vectorizer (SuperWord), I made the automatic alignment configurable with `SuperWordAutomaticAlignment`. We can disable it, align with the store or with the load.

The attached JMH benchark `VectorAutoAlignmentVisualization.bench1L1S`, with automatic alignment disabled looks like this:
![image](https://github.com/user-attachments/assets/3ee308e9-4e18-46db-a564-546be7ac8bba)

This JMH benchmark is really slow, so we can also use the `Benchark.java` from below.
I ran it on my AVX512 laptop with `Benchmark.java test1L1SScalar 4 2560 oneArray`:
![image](https://github.com/user-attachments/assets/768cfe0c-c8d7-4d76-8004-3e8b52ae35ee)
- Top left: no alignment.
- Top right: align with store.
- Bottom right: align with load.

We can see that with no alignment, we have a grid with 90% angles. If the stores are aligned, we get about `3.35 ms` runtime, if only loads are aligned we get about `4.4 ms`, and if neither is aligned `4.9 ms` - if both are aligned we get only `3.2 ms`.

With automatic alignment on stores, we get an overall better performance. But we also see the pattern is now diagonal. In most cases, we only have the store aligned, and we get about `3.4-3.5 ms`. But when the load and store are relatively aligned, i.e. on the thin diagonals, then we get even only `3.3 ms`. These performance numbers are comparable with the numbers we see on the "no alignment" plot on the horizontal lines where the stores are aligned.

With automatic alignment on loads, we get an average performance that is better than without alignment, but worse than aligning with stores. In most cases only the load is aligned, and we get `4.4 ms`. But on the rare occasion where the loads and stores are relatively aligned, i.e. the thin diagonals, we get `3.2-3.3 ms`. These numbers are comparable with the numbers we see on the "no alignment" plot on the vertical lines, where the loads are aligned.

But which one of these options should we now chose? I.e. what should be the default for `SuperWordAutomaticAlignment`? In general, we do not know the alignment of the load and store, so we should assume that we land on one of the cells at random. Thus, the relevant performance metric is the average over all cells.

The benchmark below does exactly this: it runs the loop for every `offset_load` and `offset_store` combination, essentially computing the average over all combinations.

------------------------

**Automatic Alignment in SuperWord (Auto Vectorization)**

Results with aliasing runtime checks https://github.com/openjdk/jdk/pull/24278, on `VectorAutoAlignment`, on my AVX512 laptop:
![image](https://github.com/user-attachments/assets/e8d9eb25-03ce-4ee2-bc2e-4f15899be341)
Note: before https://github.com/openjdk/jdk/pull/24278, this benchmark never vectorizes, because we cannot prove that the load and store do not alias.

There are clearly some artifacts around the 4k byte boundaries. See the discussion further up about 4k aliasing.

Other than those artifacts, it is very clear that aligning with stores is the best on my AVX512 CPU. Aligning the loads is significantly worse, and not aligning at all slightly worse than that. But in any case: vectorization is always very clearly profitable, no matter the alignment.

Running it on a aarch64 neon OCI machine:
![image](https://github.com/user-attachments/assets/7f1c23f7-b4ae-4e40-aa4a-da9e7d0556c7)
The results look quite a bit different. Vectorization is still always profitable, no matter the alignment. But now, it seems aligning loads is fastest, and there is no difference between aligning stores or no alignment at all.

I also ran it on our benchmark servers:

`linux aarch64`, (neon):
![image](https://github.com/user-attachments/assets/8e1d3c7a-0515-4ad5-b73c-3e5e45412ae6)

`linux x64`:
![image](https://github.com/user-attachments/assets/2c3e4e81-84fc-4514-b00b-3fd7a1df330f)

`macosx aarch64` (neon):
![image](https://github.com/user-attachments/assets/7976b67e-f112-4104-9cdc-9ba91e372151)

`macosx x64`:
![image](https://github.com/user-attachments/assets/5f6dd1e2-81e3-4d32-834a-072074f01d83)

`windows x64`:
![image](https://github.com/user-attachments/assets/8d79acc2-9ca6-496a-aee5-4f05157d6ee4)

The `x64` results are fairly consistent: in most cases aligning to stores is best, except for the 4k artifacts.
The `aarch64` results are less clear. On two machines we see that aligning loads is marginally faster, but on one machine aligning to stores is faster. I suspect it may depend on the exact `aarch64` implementation.

------------------------

**Standalone Benchmark.java**

I did not integrate it, because it is not very well suited for regression testing, rather for visualization only. For regression testing, I am integrating the benchmark `VectorAutoAlignment.java`. Still, I am also integrating `VectorAutoAlignmentVisualization.java`, which can be used to visualize the effect of alignment for the auto vectorizer only.

I usually run the benchmark with command-lines like this:

./java -XX:CompileCommand=compileonly,Benchmark*::test* -XX:CompileCommand=printcompilation,Benchmark*::* -Xbatch -XX:+PrintIdeal -XX:CompileCommand=printassembly,Benchmark*::test* -XX:ObjectAlignmentInBytes=8 -XX:CompileCommand=TraceAutoVectorization,Benchmark*::test*,SW_INFO,ALIGN_VECTOR -XX:+TraceLoopOpts -XX:LoopUnrollLimit=60 -XX:MaxVectorSize=64 -XX:SuperWordAutomaticAlignment=0 Benchmark.java test1L1SVector 4 2432 separateArrays

Here some relevant flags to play with:
- `ObjectAlignmentInBytes`: alignment of objects, i.e. the arrays in these benchmarks. Default is `8` bytes, which means two arrays only have a relative alignment of `8` bytes. Hence, it may not always be possible to align both references to two arrays to more than `8` bytes, i.e. we can only guarantee 64-byte alignment of at most one of them.
- `TraceAutoVectorization`: with the tag `ALIGN_VECTOR` we can see which memory reference we auto-align.
- `LoopUnrollLimit`: some benchmarks have a rather large loop, and only auto-vectorize if this limit is artificially increased.
- `MaxVectorSize`: we can artificially lower the maximum vector length, possibly breaking larger vectors into multiple smaller ones.
- `SuperWordAutomaticAlignment`: controls if and how we auto-align.

import jdk.incubator.vector.*;
import java.nio.ByteOrder;
import java.util.ArrayList;
import java.util.Set;

public class Benchmark {
    public static int SIZE;
    public static VectorSpecies<Integer> SPECIES;

    public static int[] arr0;
    public static int[] arr1;
    public static int[] arr2;
    public static int[] arr3;
    public static int base0;
    public static int base1;
    public static int base2;
    public static int base3;

    public static void main(String[] args) {
	if (args.length != 4) {
	    System.out.println("Error: need 4 arguments, got " + args.length);
	    printUsage();
	}

	String benchmarkName = args[0];

	int vectorElements = Integer.parseInt(args[1]);
	if (!Set.of(2, 4, 8, 16).contains(vectorElements)) {
	    System.out.println("Error: vectorElements must be 2, 4, 8, or 16, got " + vectorElements);
	    printUsage();
	}
	SPECIES = VectorSpecies.of(int.class, VectorShape.forBitSize(vectorElements * 4 * 8));

	SIZE = Integer.parseInt(args[2]);
	if (SIZE < 2000 || SIZE > 100_000) {
	    System.out.println("Error: dataSize out of range [2000, 100_000], got " + SIZE);
	    printUsage();
	}

	String scenario = args[3];
	switch (scenario) {
            // Load / Store from different arrays. Relative alignment is not known.
	    case "separateArrays" -> {
                arr0 = new int[SIZE];
                arr1 = new int[SIZE];
                arr2 = new int[SIZE];
                arr3 = new int[SIZE];
		base0 = 0;
		base1 = 0;
		base2 = 0;
		base3 = 0;
	    }
            // Load / Store on same array -> base have a known relative alignment.
	    // Use the whole array, every access has its own "region".
	    case "oneArray" -> {
                int[] arr = new int[4 * SIZE];
                arr0 = arr;
                arr1 = arr;
                arr2 = arr;
                arr3 = arr;
		base0 = 0 * SIZE;
		base1 = 1 * SIZE;
		base2 = 2 * SIZE;
		base3 = 3 * SIZE;
	    }
            // Load / Store on same array -> base have a known relative alignment.
	    // Small offset -> the memory accesses use the same memory "region".
	    case "oneArraySmallOffset" -> {
                int[] arr = new int[4 * SIZE];
                arr0 = arr;
                arr1 = arr;
                arr2 = arr;
                arr3 = arr;
		base0 = 0 * (1024 + 256);
		base1 = 1 * (1024 + 256);
		base2 = 2 * (1024 + 256);
		base3 = 3 * (1024 + 256);
	    }
	    default -> {
		System.out.println("Error: scenario does not exist: " + scenario);
		printUsage();
	    }
	}

	BenchmarkRunner.run(benchmarkName);
    }

    public static void printUsage() {
	System.out.println("Usage: java <jvm flags> Benchmark.java <benchmark> <vectorElements> <dataSize> <scenario>");
	System.out.println("  benchmark:");
	System.out.println("    test1L1SVector test1L1SVectorSkip test1L1SScalar");
	System.out.println("    test2L1SVector test2L1SVectorSkip test2L1SScalar test2L1SScalarRearranged");
	System.out.println("    test3L1SVector test3L1SVectorSkip test3L1SScalar");
	System.out.println("  vectorElements: 2, 4, 8, 16");
	System.out.println("  dataSize: 2000 ... 100_000. Recommended: 2048.");
	System.out.println("  scenario: separateArrays oneArray oneArraySmallOffset");
        System.exit(0);
    }
}

public class BenchmarkRunner {
    // Make sure the runner has all these fields final, so we get a better chance at optimisation.
    public static final int SIZE = Benchmark.SIZE;
    public static final VectorSpecies<Integer> SPECIES = Benchmark.SPECIES;

    public static final int REPS = 50_000; // Repeat REPS times for a benchmark measurement.
    public static final int RUNS = 5; // Each benchmark measurement is repeated RUNS times, and MIN runtime is chosen.
    public static final int GRID = 32;

    public static int[] arr0 = Benchmark.arr0; // store
    public static int[] arr1 = Benchmark.arr1; // load
    public static int[] arr2 = Benchmark.arr2; // load
    public static int[] arr3 = Benchmark.arr3; // load
    public static final int base0 = Benchmark.base0;
    public static final int base1 = Benchmark.base1;
    public static final int base2 = Benchmark.base2;
    public static final int base3 = Benchmark.base3;

    interface GridBenchmark {
        void run(int offset_load, int offset_store);
    }

    public static void run(String benchmarkName) {
	switch (benchmarkName) {
	    case "test1L1SVector" -> benchmarkGrid(BenchmarkRunner::test1L1SVector);
	    case "test2L1SVector" -> benchmarkGrid(BenchmarkRunner::test2L1SVector);
	    case "test3L1SVector" -> benchmarkGrid(BenchmarkRunner::test3L1SVector);
	    case "test1L1SVectorSkip" -> benchmarkGrid(BenchmarkRunner::test1L1SVectorSkip);
	    case "test2L1SVectorSkip" -> benchmarkGrid(BenchmarkRunner::test2L1SVectorSkip);
	    case "test3L1SVectorSkip" -> benchmarkGrid(BenchmarkRunner::test3L1SVectorSkip);
	    case "test1L1SScalar" -> benchmarkGrid(BenchmarkRunner::test1L1SScalar);
	    case "test2L1SScalar" -> benchmarkGrid(BenchmarkRunner::test2L1SScalar);
	    case "test3L1SScalar" -> benchmarkGrid(BenchmarkRunner::test3L1SScalar);
	    case "test2L1SScalarRearranged" -> benchmarkGrid(BenchmarkRunner::test2L1SScalarRearranged);
	    default -> {
		System.out.println("Error: benchmark does not exist: " + benchmarkName);
		Benchmark.printUsage();
	    }
	}
	System.out.println("Done: " + benchmarkName);
        System.out.println("x-axis  (->)  LOAD_OFFSET");                                                                                                                                                                                                                                                                                                                                                                          
        System.out.println("y-axis  (up)  STORE_OFFSET");
	System.out.println("offset_load: load alignment shift");
	System.out.println("offset_store: store alignment shift");
    }

    public static void benchmarkGrid(GridBenchmark gt) {
	System.out.println("Initial Warmup");
        for (int i = 0; i < 10 * REPS; i++) {
	    gt.run(0, 0);
	}

	ArrayList<String> list = new ArrayList<>();
	float total = 0;
        for (int offset_store = 0; offset_store < GRID; offset_store++) {
	    String line = "";
            for (int offset_load = 0; offset_load < GRID; offset_load++) {
		float t = Float.POSITIVE_INFINITY;
		for (int i = 0; i < RUNS; i++) {
		    t = Math.min(t, benchmark(offset_load, offset_store, gt));
		}
		total += t;
                line += String.format("%.5f ", t);
	    }
	    System.out.println(line);
	    list.add(line);
	}
	System.out.println("Results [ms]:");
	// reverse the list, so the 0/0 point is at the bottom left.
	for (var line : list.reversed()) {
	    System.out.println(line);
	}
	System.out.println("total [ms]: " + total);
    }

    public static float benchmark(int offset_load, int offset_store, GridBenchmark gt) {
        for (int i = 0; i < REPS; i++) {
	    gt.run(offset_load, offset_store);
	}
	long t0 = System.nanoTime();
        for (int i = 0; i < REPS; i++) {
	    gt.run(offset_load, offset_store);
	}
	long t1 = System.nanoTime();
	float t = (t1 - t0) * 1e-6f;
	return t;
    }

    public static void vector1L1S(int offset_load, int offset_store, int i) {
        var v = IntVector.fromArray(SPECIES, arr1, base1 + i + offset_load);
        v.intoArray(arr0, base0 + i + offset_store);
    }

    public static void vector2L1S(int offset_load, int offset_store, int i) {
        var v0 = IntVector.fromArray(SPECIES, arr1, base1 + i + offset_load);
        var v1 = IntVector.fromArray(SPECIES, arr2, base2 + i + offset_load);
	var v = v0.add(v1);
        v.intoArray(arr0, base0 + i + offset_store);
    }

    public static void vector3L1S(int offset_load, int offset_store, int i) {
        var v0 = IntVector.fromArray(SPECIES, arr1, base1 + i + offset_load);
        var v1 = IntVector.fromArray(SPECIES, arr2, base2 + i + offset_load);
        var v2 = IntVector.fromArray(SPECIES, arr3, base3 + i + offset_load);
	var v = v0.add(v1).add(v2);
        v.intoArray(arr0, base0 + i + offset_store);
    }

    public static void test1L1SVector(int offset_load, int offset_store) {
        for (int i = 0; i < SIZE - 64 - GRID; i += SPECIES.length()) {
	    vector1L1S(offset_load, offset_store, i);
        }
    }

    public static void test2L1SVector(int offset_load, int offset_store) {
        for (int i = 0; i < SIZE - 64 - GRID; i += SPECIES.length()) {
	    vector2L1S(offset_load, offset_store, i);
        }
    }

    public static void test3L1SVector(int offset_load, int offset_store) {
        for (int i = 0; i < SIZE - 64 - GRID; i += SPECIES.length()) {
	    vector3L1S(offset_load, offset_store, i);
        }
    }

    // This one is to prove that the split happens on the cache line.
    //
    // Note: this is written for vectorElements = 4
    public static void test1L1SVectorSkip(int offset_load, int offset_store) {
        for (int i = 0; i < SIZE - 64 - GRID; i += 16) {
	    vector1L1S(offset_load, offset_store, i + 0);
	    vector1L1S(offset_load, offset_store, i + 4);
	    vector1L1S(offset_load, offset_store, i + 8);
	    // Skip the "i + 12" step, so we do not always go over the cache line.
        }
    }

    // This one is to prove that the split happens on the cache line.
    //
    // Note: this is written for vectorElements = 4
    public static void test2L1SVectorSkip(int offset_load, int offset_store) {
        for (int i = 0; i < SIZE - 64 - GRID; i += 16) {
	    vector2L1S(offset_load, offset_store, i + 0);
	    vector2L1S(offset_load, offset_store, i + 4);
	    vector2L1S(offset_load, offset_store, i + 8);
	    // Skip the "i + 12" step, so we do not always go over the cache line.
        }
    }

    // This one is to prove that the split happens on the cache line.
    //
    // Note: this is written for vectorElements = 4
    public static void test3L1SVectorSkip(int offset_load, int offset_store) {
        for (int i = 0; i < SIZE - 64 - GRID; i += 16) {
	    vector3L1S(offset_load, offset_store, i + 0);
	    vector3L1S(offset_load, offset_store, i + 4);
	    vector3L1S(offset_load, offset_store, i + 8);
	    // Skip the "i + 12" step, so we do not always go over the cache line.
        }
    }

    // Requires aliasing analysis runtime check JDK-8324751 to vectorize.
    public static void test1L1SScalar(int offset_load, int offset_store) {
        for (int i = 0; i < SIZE - GRID; i++) {
	    int v = arr1[base1 + i + offset_load];
	    arr0[base0 + i + offset_store] = v;
        }
    }

    // Requires aliasing analysis runtime check JDK-8324751 to vectorize.
    public static void test2L1SScalar(int offset_load, int offset_store) {
        for (int i = 0; i < SIZE - GRID; i++) {
	    int v0 = arr1[base1 + i + offset_load];
	    int v1 = arr2[base2 + i + offset_load];
	    var v = v0 + v1;
	    arr0[base0 + i + offset_store] = v;
        }
    }

    // Requires aliasing analysis runtime check JDK-8324751 to vectorize.
    public static void test3L1SScalar(int offset_load, int offset_store) {
        for (int i = 0; i < SIZE - GRID; i++) {
	    int v0 = arr1[base1 + i + offset_load];
	    int v1 = arr2[base2 + i + offset_load];
	    int v2 = arr3[base3 + i + offset_load];
	    var v = v0 + v1 + v2;
	    arr0[base0 + i + offset_store] = v;
        }
    }

    // Vectorizes even without JDK-8324751, but requires -XX:LoopUnrollLimit=10000 because loop body is large.
    // Automatic alignment is ineffective here, because of the hand-unrolling -> pre-loop cannot change alignment.
    // Gets us some funky patterns, as automatic alignment sometimes seems to actually make things slightly worse.
    //
    // Note: this test does not react to vectorElements.
    public static void test2L1SScalarRearranged(int offset_load, int offset_store) {
        for (int i = 0; i < SIZE - 4 - GRID; i+=4) {
	    int v00 = arr1[base1 + i + offset_load + 0];
	    int v10 = arr1[base1 + i + offset_load + 1];
	    int v20 = arr1[base1 + i + offset_load + 2];
	    int v30 = arr1[base1 + i + offset_load + 3];
	    int v01 = arr2[base2 + i + offset_load + 0];
	    int v11 = arr2[base2 + i + offset_load + 1];
	    int v21 = arr2[base2 + i + offset_load + 2];
	    int v31 = arr2[base2 + i + offset_load + 3];
	    var v0 = v00 + v01;
	    var v1 = v10 + v11;
	    var v2 = v20 + v21;
	    var v3 = v30 + v31;
	    arr0[base0 + i + offset_store + 0] = v0;
	    arr0[base0 + i + offset_store + 1] = v1;
	    arr0[base0 + i + offset_store + 2] = v2;
	    arr0[base0 + i + offset_store + 3] = v3;
        }
    }
}

-------------

Commit messages:
 - Merge branch 'master' into JDK-8355094-SW-alignment
 - improve benchmarks
 - rename bench
 - more comments
 - fix whitespace
 - another benchmark
 - JDK-8355094

Changes: https://git.openjdk.org/jdk/pull/25065/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=25065&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8355094
  Stats: 341 lines in 4 files changed: 340 ins; 0 del; 1 mod
  Patch: https://git.openjdk.org/jdk/pull/25065.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/25065/head:pull/25065

PR: https://git.openjdk.org/jdk/pull/25065