RFR: 8365290: [perf] x86 ArrayFill intrinsic generates SPLIT_STORE for unaligned arrays [v4]
Emanuel Peter
epeter at openjdk.org
Tue Sep 9 06:46:13 UTC 2025
On Sat, 30 Aug 2025 16:04:00 GMT, Vladimir Ivanov <vaivanov at openjdk.org> wrote:
>> On the SRF platform for runs with intrinsic scores for the ArrayFill test reports ~2x drop for several sizes due to a lot of the 'MEM_UOPS_RETIRED.SPLIT_STORES' events. The 'good' case for the ArraysFill.testCharFill with size=8195 reports numbers like
>> MEM_UOPS_RETIRED.SPLIT_LOADS | 22.6711
>> MEM_UOPS_RETIRED.SPLIT_STORES | 4.0859
>> while for 'bad' case these metrics are
>> MEM_UOPS_RETIRED.SPLIT_LOADS | 69.1785
>> MEM_UOPS_RETIRED.SPLIT_STORES | 259200.3659
>>
>> With alignment on the cache size no score drops due to split_stores but small reduction may be reported due to extra
>> SRF 6740E | Size | orig | pathed | pO/orig
>> -- | -- | -- | -- | --
>> ArraysFill.testByteFill | 16 | 152031.2 | 157001.2 | 1.03
>> ArraysFill.testByteFill | 31 | 125795.9 | 177399.2 | 1.41
>> ArraysFill.testByteFill | 250 | 57961.69 | 120981.9 | 2.09
>> ArraysFill.testByteFill | 266 | 44900.15 | 147893.8 | 3.29
>> ArraysFill.testByteFill | 511 | 61908.17 | 129830.1 | 2.10
>> ArraysFill.testByteFill | 2047 | 32255.51 | 41986.6 | 1.30
>> ArraysFill.testByteFill | 2048 | 31928.97 | 42154.3 | 1.32
>> ArraysFill.testByteFill | 8195 | 10690.15 | 11036.3 | 1.03
>> ArraysFill.testIntFill | 16 | 145030.7 | 318796.9 | 2.20
>> ArraysFill.testIntFill | 31 | 134138.4 | 212487 | 1.58
>> ArraysFill.testIntFill | 250 | 74179.23 | 79522.66 | 1.07
>> ArraysFill.testIntFill | 266 | 68112.72 | 60116.49 | 0.88
>> ArraysFill.testIntFill | 511 | 39693.28 | 36225.09 | 0.91
>> ArraysFill.testIntFill | 2047 | 11504.14 | 10616.91 | 0.92
>> ArraysFill.testIntFill | 2048 | 11244.71 | 10969.14 | 0.98
>> ArraysFill.testIntFill | 8195 | 2751.289 | 2692.216 | 0.98
>> ArraysFill.testLongFill | 16 | 212532.5 | 212526 | 1.00
>> ArraysFill.testLongFill | 31 | 137432.4 | 137283.3 | 1.00
>> ArraysFill.testLongFill | 250 | 43185 | 43159.78 | 1.00
>> ArraysFill.testLongFill | 266 | 42172.22 | 42170.5 | 1.00
>> ArraysFill.testLongFill | 511 | 23370.15 | 23370.86 | 1.00
>> ArraysFill.testLongFill | 2047 | 6123.008 | 6122.73 | 1.00
>> ArraysFill.testLongFill | 2048 | 5793.722 | 5792.855 | 1.00
>> ArraysFill.testLongFill | 8195 | 616.552 | 616.585 | 1.00
>> ArraysFill.testShortFill | 16 | 152088.6 | 265646.1 | 1.75
>> ArraysFill.testShortFill | 31 | 137369.8 | 185596.4 | 1.35
>> ArraysFill.testShortFill | 250 | 58872.03 | 99621.15 | 1.69
>> ArraysFill.testShortFill | 266 | 91085.31 | 93746.62 | 1.03
>> ArraysFill.testShortFill | 511 | 65331.96 | 78003.83 | 1.19
>> ArraysFill.testShortFill | 2047 | 21716.32 | 21216.81 | 0.98
>> ArraysFill.testShortFill...
>
> Vladimir Ivanov has updated the pull request incrementally with one additional commit since the last revision:
>
> JDK-8365290 [perf] x86 ArrayFill intrinsic generates SPLIT_STORE for unaligned arrays
I got a bit inspired and did some digging.
TLDR: we need better benchmarks, I think.
The current ones don't control for alignment, and are quite unstable with only 3 forks.
----------------------
I'm wondering if the benchmark `./test/micro/org/openjdk/bench/java/util/ArraysFill.java` is really very well suited to give us good information about unaligned arrays. Because all it does it just allocate arrays, but we have no information about the alignment at all. If I'm not wrong, I see that we only have 3 forks, so we do this:
- 3x do:
- allocate: maybe aligned, maybe misaligned
- warmup
- benchmark
This only really gives us 3 alignment configurations. And in all cases, we start filling from the beginning of the array. That gives us really noisy information regarding alignment.
What I'd love to see is some benchmark where we can manually adjust alignment, and iterate over all possible alignments/misalignments. We could do that by using `Arrays.fill(array, start, end, value)`, and filling only a part of the array, starting at different starting offsets.
Honestly, I'm a little shocked to find no JMH benchmark using `Arrays.fill(array, start, end, value)`, they all seem to use `Arrays.fill(array, value)`.
We should also compare to SuperWord performance, to see how well the intrinsic is doing in comparison. Personally, my hope would be that eventually auto-vectorization gets as close as possible to the intrinsics, I'm not sure how close we are exactly yet. Here a related tracking issue: [JDK-8299808](https://bugs.openjdk.org/browse/JDK-8299808)
Let me see if there are any other related benchmarks we could use here:
`./test/micro/org/openjdk/bench/vm/compiler/ArrayFill.java`
It probably tests for automatic detection of fill loops, which should also use the fill intrinsic. But we also have no real control of alignment there.
I did some previous work on alignment, though for auto-vectorization:
- `test/micro/org/openjdk/bench/vm/compiler/VectorAutoAlignment.java`
- Memory segment with native memory. Not directly relevant here, but the method of going over all sorts of `offset_load` and `offset_store` is helpful for array fill and copy.
I'm now looking at copy intrinsics too (`System.arraycopy`). Important to note: we just allowed many more copy patterns to auto-vectorize with [JDK-8324751](https://bugs.openjdk.org/browse/JDK-8324751) (Aliasing runtime check), so we have more things to compare to here.
- `./test/micro/org/openjdk/bench/java/lang/ArrayCopy.java`
- only does full array copy, so no control for alignment
- `./test/micro/org/openjdk/bench/java/lang/ArrayCopyAligned.java`
- I think it claims to always work with an aligned base. But that's probably not true any more with Lilliput that changes the alignment of the first element for some element types.
- These do a similar thing, so no idea if they really hold their promise, i.e. if they really measure what they promise. They also only measure one alignment configuration, so that's a weakness.
- `./test/micro/org/openjdk/bench/java/lang/ArrayCopyUnalignedBoth.java`
- `./test/micro/org/openjdk/bench/java/lang/ArrayCopyUnalignedDst.java`
- `./test/micro/org/openjdk/bench/java/lang/ArrayCopyUnalignedSrc.java`
- `test/micro/org/openjdk/bench/java/lang/ArrayFiddle.java`
- claims that intrinsics are a lot better than hand written loops. We may want to look into that info again since [JDK-8324751](https://bugs.openjdk.org/browse/JDK-8324751) (Aliasing runtime check).
Then there are also the `MemorySegment` fill and copy benchmarks. But maybe we leave those for another day.
-----------------------
I'm filing an RFE to create better `fill` and `copy` benchmarks.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/26747#issuecomment-3269114783
More information about the hotspot-dev
mailing list