Array addition and array sum Panama benchmarks
Antoine Chambille
ach at activeviam.com
Mon Sep 29 09:11:11 UTC 2025
Hello,
I've run the array addition benchmark again, JDK-25 and JDK-26ea. Looks
like the performance issues I’d been tracking for a while have been solved
in JDK 26.
https://github.com/chamb/panama-benchmarks
Auto vectorisation of scalar loops seems to work when using MemorySegment
and is even faster than with java arrays or the vector API. Also loops with
long stride don't prevent auto vectorisation anymore.
Not sure exactly who we owe these improvements to, but it's awesome! Here's
another use case where we can confidently switch from Unsafe to
MemorySegment. The dream would be to see these enhancements land in JDK 25,
of course...
JDK 25
Benchmark Mode Cnt Score Error
Units
AddBenchmark.scalarArrayArray avgt 5 167.028 ± 5.604
ns/op
AddBenchmark.scalarArrayArrayLongStride avgt 5 925.673 ± 37.766
ns/op
AddBenchmark.scalarSegmentArray avgt 5 550.540 ± 3.576
ns/op
AddBenchmark.scalarSegmentSegment avgt 5 548.861 ± 1.852
ns/op
AddBenchmark.scalarUnsafeArray avgt 5 600.489 ± 219.285
ns/op
AddBenchmark.scalarUnsafeUnsafe avgt 5 776.975 ± 11.601
ns/op
AddBenchmark.unrolledArrayArray avgt 5 863.526 ± 58.822
ns/op
AddBenchmark.unrolledSegmentArray avgt 5 584.230 ± 13.863
ns/op
AddBenchmark.unrolledUnsafeArray avgt 5 584.898 ± 15.792
ns/op
AddBenchmark.unrolledUnsafeUnsafe avgt 5 761.445 ± 59.935
ns/op
AddBenchmark.vectorArrayArray avgt 5 177.288 ± 0.653
ns/op
AddBenchmark.vectorArraySegment avgt 5 141.381 ± 1.211
ns/op
AddBenchmark.vectorSegmentArray avgt 5 141.576 ± 3.077
ns/op
AddBenchmark.vectorSegmentSegment avgt 5 217.639 ± 5.076
ns/op
JDK 26 b17
Benchmark Mode Cnt Score Error
Units
AddBenchmark.scalarArrayArray avgt 5 209.653 ± 5.990
ns/op
AddBenchmark.scalarArrayArrayLongStride avgt 5 209.948 ± 12.925
ns/op
*AddBenchmark.scalarSegmentArray avgt 5 111.790 ± 5.971
ns/opAddBenchmark.scalarSegmentSegment avgt 5 136.414 ±
3.900 ns/op*
AddBenchmark.scalarUnsafeArray avgt 5 657.565 ± 4.705
ns/op
AddBenchmark.scalarUnsafeUnsafe avgt 5 832.016 ± 210.295
ns/op
AddBenchmark.unrolledArrayArray avgt 5 1095.963 ± 153.910
ns/op
AddBenchmark.unrolledSegmentArray avgt 5 138.410 ± 11.933
ns/op
AddBenchmark.unrolledUnsafeArray avgt 5 685.867 ± 27.075
ns/op
AddBenchmark.unrolledUnsafeUnsafe avgt 5 817.802 ± 30.841
ns/op
AddBenchmark.vectorArrayArray avgt 5 149.027 ± 1.269
ns/op
AddBenchmark.vectorArraySegment avgt 5 164.590 ± 7.283
ns/op
AddBenchmark.vectorSegmentArray avgt 5 196.908 ± 5.610
ns/op
AddBenchmark.vectorSegmentSegment avgt 5 242.377 ± 5.488
ns/op
Best,
-Antoine
On Mon, Sep 30, 2024 at 2:16 PM Antoine Chambille <ach at activeviam.com>
wrote:
> Hi Maurizio, thanks for the quick response. Looking forward to it.
> -Antoine
>
> On Mon, Sep 30, 2024 at 2:11 PM Maurizio Cimadamore <
> maurizio.cimadamore at oracle.com> wrote:
>
>> Hi Antoine,
>> auto-vectorization on memory segments doesn't work in some cases. This
>> issue is mostly due to:
>>
>> https://bugs.openjdk.org/browse/JDK-8324751
>>
>> That is, when working with a "source" and a "target" segment, if the
>> auto-vectorizer cannot prove that the two segments are disjoint, no
>> vectorization occurs.
>>
>> This is an issue for operations like add, or copy, but it's not an issue
>> with something like MemorySegment::fill (as that method only works on a
>> single segment).
>>
>> We hope to be able to make some progress on this issue, as that will
>> allow 3rd party routines on memory segment to enjoy vectorization too w/o
>> the need of having an intrinsics in the JDK.
>>
>> Maurizio
>>
>>
>>
>>
>> On 30/09/2024 13:04, Antoine Chambille wrote:
>>
>> Hello everyone,
>>
>> I've rebuilt the latest OpenJDK (24) from
>> https://github.com/openjdk/panama-vector and run the arrays addition
>> benchmark another time:
>>
>> AddBenchmark
>> .scalarArrayArray thrpt 5 6487636 ops/s
>> .scalarArrayArrayLongStride thrpt 5 1001515 ops/s
>> .scalarSegmentArray thrpt 5 1747531 ops/s
>> .scalarSegmentSegment thrpt 5 1154193 ops/s
>> .scalarUnsafeArray thrpt 5 6970073 ops/s
>> .scalarUnsafeUnsafe thrpt 5 1246625 ops/s
>> .unrolledArrayArray thrpt 5 1251824 ops/s
>> .unrolledSegmentArray thrpt 5 1694164 ops/s
>> .unrolledUnsafeArray thrpt 5 5043685 ops/s
>> .unrolledUnsafeUnsafe thrpt 5 1197024 ops/s
>> .vectorArrayArray thrpt 5 7200224 ops/s
>> .vectorArraySegment thrpt 5 7377553 ops/s
>> .vectorSegmentArray thrpt 5 7263505 ops/s
>> .vectorSegmentSegment thrpt 5 7143647 ops/s
>>
>>
>> - Performance using the vector API is now very consistent and good
>> across arrays and segments.
>> - Reading and writing from/to segments still seems to be disrupting
>> auto-vectorization. Reading with Unsafe works well but it's marked for
>> removal.
>> - Less important, manual unrolling also seems to be disrupting
>> auto-vectorization.
>>
>>
>>
>> Best,
>> -Antoine
>>
>> On Tue, Mar 26, 2024 at 5:40 PM Vladimir Ivanov <
>> vladimir.x.ivanov at oracle.com> wrote:
>>
>>>
>>> >> Personally, I prefer to see vectorizer handling "MoveX2Y (LoadX mem)"
>>> >> => "VectorReinterpret (LoadVector mem)" well and then introduce rules
>>> to
>>> >> strength-reduce it to mismatched access.
>>> >
>>> > Do I understand you right that you're saying the vector node for
>>> MoveL2D
>>> > (for instance) is VectorReinterpret so we could vectorize the code.
>>> >
>>> > Are you then suggesting that we can transform:
>>> >
>>> > (VectorReinterpret (LoadVector mem)
>>> >
>>> > into:
>>> >
>>> > (LoadVector mem)
>>> >
>>> > with that LoadVector a mismatched access?
>>>
>>> Yes, but thinking more about it, the latter step may be optional. For
>>> example, VectorReinterpret implementation on x86 is a no-op, so not much
>>> gained from folding VectorReinterpret+LoadVector into a mismatched
>>> LoadVector.
>>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20250929/0dc7beaf/attachment.htm>
More information about the panama-dev
mailing list