Array addition and array sum Panama benchmarks
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Mon Sep 29 09:26:42 UTC 2025
Hi Antoine,
Thanks for the reply. All credit here goes to Emanuel (cc'ed). I believe
the main issues with memory segments and autovectorization were fixed as
part of this:
https://bugs.openjdk.org/browse/JDK-8324751
You might also want to watch his great JVMLS talk:
https://inside.java/2025/08/16/jvmls-hotspot-auto-vectorization/
Cheers
Maurizio
On 29/09/2025 10:11, Antoine Chambille wrote:
> Hello,
>
> I've run the array addition benchmark again, JDK-25 and JDK-26ea.
> Looks like the performance issues I’d been tracking for a while have
> been solved in JDK 26.
> https://github.com/chamb/panama-benchmarks
>
> Auto vectorisation of scalar loops seems to work when using
> MemorySegment and is even faster than with java arrays or the vector
> API. Also loops with long stride don't prevent auto vectorisation
> anymore.
>
> Not sure exactly who we owe these improvements to, but it's awesome!
> Here's another use case where we can confidently switch from Unsafe to
> MemorySegment. The dream would be to see these enhancements land in
> JDK 25, of course...
>
>
> JDK 25
>
> Benchmark Mode Cnt Score Error Units
> AddBenchmark.scalarArrayArray avgt 5 167.028 ± 5.604
> ns/op
> AddBenchmark.scalarArrayArrayLongStride avgt 5 925.673 ± 37.766
> ns/op
> AddBenchmark.scalarSegmentArray avgt 5 550.540 ± 3.576
> ns/op
> AddBenchmark.scalarSegmentSegment avgt 5 548.861 ± 1.852
> ns/op
> AddBenchmark.scalarUnsafeArray avgt 5 600.489 ± 219.285
> ns/op
> AddBenchmark.scalarUnsafeUnsafe avgt 5 776.975 ± 11.601
> ns/op
> AddBenchmark.unrolledArrayArray avgt 5 863.526 ± 58.822
> ns/op
> AddBenchmark.unrolledSegmentArray avgt 5 584.230 ± 13.863
> ns/op
> AddBenchmark.unrolledUnsafeArray avgt 5 584.898 ± 15.792
> ns/op
> AddBenchmark.unrolledUnsafeUnsafe avgt 5 761.445 ± 59.935
> ns/op
> AddBenchmark.vectorArrayArray avgt 5 177.288 ± 0.653
> ns/op
> AddBenchmark.vectorArraySegment avgt 5 141.381 ± 1.211
> ns/op
> AddBenchmark.vectorSegmentArray avgt 5 141.576 ± 3.077
> ns/op
> AddBenchmark.vectorSegmentSegment avgt 5 217.639 ± 5.076
> ns/op
>
>
> JDK 26 b17
>
> Benchmark Mode Cnt Score Error Units
> AddBenchmark.scalarArrayArray avgt 5 209.653 ± 5.990
> ns/op
> AddBenchmark.scalarArrayArrayLongStride avgt 5 209.948 ± 12.925
> ns/op
> *AddBenchmark.scalarSegmentArray avgt 5 111.790 ± 5.971
> ns/op
> AddBenchmark.scalarSegmentSegment avgt 5 136.414 ± 3.900
> ns/op*
> AddBenchmark.scalarUnsafeArray avgt 5 657.565 ± 4.705
> ns/op
> AddBenchmark.scalarUnsafeUnsafe avgt 5 832.016 ± 210.295
> ns/op
> AddBenchmark.unrolledArrayArray avgt 5 1095.963 ± 153.910
> ns/op
> AddBenchmark.unrolledSegmentArray avgt 5 138.410 ± 11.933
> ns/op
> AddBenchmark.unrolledUnsafeArray avgt 5 685.867 ± 27.075
> ns/op
> AddBenchmark.unrolledUnsafeUnsafe avgt 5 817.802 ± 30.841
> ns/op
> AddBenchmark.vectorArrayArray avgt 5 149.027 ± 1.269
> ns/op
> AddBenchmark.vectorArraySegment avgt 5 164.590 ± 7.283
> ns/op
> AddBenchmark.vectorSegmentArray avgt 5 196.908 ± 5.610
> ns/op
> AddBenchmark.vectorSegmentSegment avgt 5 242.377 ± 5.488
> ns/op
>
>
> Best,
> -Antoine
>
> On Mon, Sep 30, 2024 at 2:16 PM Antoine Chambille <ach at activeviam.com>
> wrote:
>
> Hi Maurizio, thanks for the quick response. Looking forward to it.
> -Antoine
>
> On Mon, Sep 30, 2024 at 2:11 PM Maurizio Cimadamore
> <maurizio.cimadamore at oracle.com> wrote:
>
> Hi Antoine,
> auto-vectorization on memory segments doesn't work in some
> cases. This issue is mostly due to:
>
> https://bugs.openjdk.org/browse/JDK-8324751
>
> That is, when working with a "source" and a "target" segment,
> if the auto-vectorizer cannot prove that the two segments are
> disjoint, no vectorization occurs.
>
> This is an issue for operations like add, or copy, but it's
> not an issue with something like MemorySegment::fill (as that
> method only works on a single segment).
>
> We hope to be able to make some progress on this issue, as
> that will allow 3rd party routines on memory segment to enjoy
> vectorization too w/o the need of having an intrinsics in the JDK.
>
> Maurizio
>
>
>
>
> On 30/09/2024 13:04, Antoine Chambille wrote:
>> Hello everyone,
>>
>> I've rebuilt the latest OpenJDK (24) from
>> https://github.com/openjdk/panama-vector and run the arrays
>> addition benchmark another time:
>>
>> AddBenchmark
>> .scalarArrayArray thrpt 5 6487636 ops/s
>> .scalarArrayArrayLongStride thrpt 5 1001515 ops/s
>> .scalarSegmentArray thrpt 5 1747531 ops/s
>> .scalarSegmentSegment thrpt 5 1154193 ops/s
>> .scalarUnsafeArray thrpt 5 6970073 ops/s
>> .scalarUnsafeUnsafe thrpt 5 1246625 ops/s
>> .unrolledArrayArray thrpt 5 1251824 ops/s
>> .unrolledSegmentArray thrpt 5 1694164 ops/s
>> .unrolledUnsafeArray thrpt 5 5043685 ops/s
>> .unrolledUnsafeUnsafe thrpt 5 1197024 ops/s
>> .vectorArrayArray thrpt 5 7200224 ops/s
>> .vectorArraySegment thrpt 5 7377553 ops/s
>> .vectorSegmentArray thrpt 5 7263505 ops/s
>> .vectorSegmentSegment thrpt 5 7143647 ops/s
>>
>> * Performance using the vector API is now very consistent
>> and good across arrays and segments.
>> * Reading and writing from/to segments still seems to be
>> disrupting auto-vectorization. Reading with Unsafe works
>> well but it's marked for removal.
>> * Less important, manual unrolling also seems to be
>> disrupting auto-vectorization.
>>
>>
>>
>> Best,
>> -Antoine
>>
>> On Tue, Mar 26, 2024 at 5:40 PM Vladimir Ivanov
>> <vladimir.x.ivanov at oracle.com> wrote:
>>
>>
>> >> Personally, I prefer to see vectorizer handling
>> "MoveX2Y (LoadX mem)"
>> >> => "VectorReinterpret (LoadVector mem)" well and then
>> introduce rules to
>> >> strength-reduce it to mismatched access.
>> >
>> > Do I understand you right that you're saying the vector
>> node for MoveL2D
>> > (for instance) is VectorReinterpret so we could
>> vectorize the code.
>> >
>> > Are you then suggesting that we can transform:
>> >
>> > (VectorReinterpret (LoadVector mem)
>> >
>> > into:
>> >
>> > (LoadVector mem)
>> >
>> > with that LoadVector a mismatched access?
>>
>> Yes, but thinking more about it, the latter step may be
>> optional. For
>> example, VectorReinterpret implementation on x86 is a
>> no-op, so not much
>> gained from folding VectorReinterpret+LoadVector into a
>> mismatched
>> LoadVector.
>>
>> Best regards,
>> Vladimir Ivanov
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20250929/3b0eceef/attachment-0001.htm>
More information about the panama-dev
mailing list