Array addition and array sum Panama benchmarks

Mon Sep 29 09:11:11 UTC 2025

Hello,

I've run the array addition benchmark again, JDK-25 and JDK-26ea. Looks
like the performance issues I’d been tracking for a while have been solved
in JDK 26.
https://github.com/chamb/panama-benchmarks

Auto vectorisation of scalar loops seems to work when using MemorySegment
and is even faster than with java arrays or the vector API. Also loops with
long stride don't prevent auto vectorisation anymore.

Not sure exactly who we owe these improvements to, but it's awesome! Here's
another use case where we can confidently switch from Unsafe to
MemorySegment. The dream would be to see these enhancements land in JDK 25,
of course...

JDK 25

Benchmark                                Mode  Cnt     Score     Error
 Units
AddBenchmark.scalarArrayArray            avgt    5   167.028 ±   5.604
 ns/op
AddBenchmark.scalarArrayArrayLongStride  avgt    5   925.673 ±  37.766
 ns/op
AddBenchmark.scalarSegmentArray          avgt    5   550.540 ±   3.576
 ns/op
AddBenchmark.scalarSegmentSegment        avgt    5   548.861 ±   1.852
 ns/op
AddBenchmark.scalarUnsafeArray           avgt    5   600.489 ± 219.285
 ns/op
AddBenchmark.scalarUnsafeUnsafe          avgt    5   776.975 ±  11.601
 ns/op
AddBenchmark.unrolledArrayArray          avgt    5   863.526 ±  58.822
 ns/op
AddBenchmark.unrolledSegmentArray        avgt    5   584.230 ±  13.863
 ns/op
AddBenchmark.unrolledUnsafeArray         avgt    5   584.898 ±  15.792
 ns/op
AddBenchmark.unrolledUnsafeUnsafe        avgt    5   761.445 ±  59.935
 ns/op
AddBenchmark.vectorArrayArray            avgt    5   177.288 ±   0.653
 ns/op
AddBenchmark.vectorArraySegment          avgt    5   141.381 ±   1.211
 ns/op
AddBenchmark.vectorSegmentArray          avgt    5   141.576 ±   3.077
 ns/op
AddBenchmark.vectorSegmentSegment        avgt    5   217.639 ±   5.076
 ns/op

JDK 26 b17

Benchmark                                Mode  Cnt     Score     Error
 Units
AddBenchmark.scalarArrayArray            avgt    5   209.653 ±   5.990
 ns/op
AddBenchmark.scalarArrayArrayLongStride  avgt    5   209.948 ±  12.925
 ns/op

*AddBenchmark.scalarSegmentArray          avgt    5   111.790 ±   5.971
 ns/opAddBenchmark.scalarSegmentSegment        avgt    5   136.414 ±
3.900  ns/op*
AddBenchmark.scalarUnsafeArray           avgt    5   657.565 ±   4.705
 ns/op
AddBenchmark.scalarUnsafeUnsafe          avgt    5   832.016 ± 210.295
 ns/op
AddBenchmark.unrolledArrayArray          avgt    5  1095.963 ± 153.910
 ns/op
AddBenchmark.unrolledSegmentArray        avgt    5   138.410 ±  11.933
 ns/op
AddBenchmark.unrolledUnsafeArray         avgt    5   685.867 ±  27.075
 ns/op
AddBenchmark.unrolledUnsafeUnsafe        avgt    5   817.802 ±  30.841
 ns/op
AddBenchmark.vectorArrayArray            avgt    5   149.027 ±   1.269
 ns/op
AddBenchmark.vectorArraySegment          avgt    5   164.590 ±   7.283
 ns/op
AddBenchmark.vectorSegmentArray          avgt    5   196.908 ±   5.610
 ns/op
AddBenchmark.vectorSegmentSegment        avgt    5   242.377 ±   5.488
 ns/op

Best,
-Antoine

On Mon, Sep 30, 2024 at 2:16 PM Antoine Chambille <ach at activeviam.com>
wrote:

> Hi Maurizio, thanks for the quick response. Looking forward to it.
> -Antoine
>
> On Mon, Sep 30, 2024 at 2:11 PM Maurizio Cimadamore <
> maurizio.cimadamore at oracle.com> wrote:
>
>> Hi Antoine,
>> auto-vectorization on memory segments doesn't work in some cases. This
>> issue is mostly due to:
>>
>> https://bugs.openjdk.org/browse/JDK-8324751
>>
>> That is, when working with a "source" and a "target" segment, if the
>> auto-vectorizer cannot prove that the two segments are disjoint, no
>> vectorization occurs.
>>
>> This is an issue for operations like add, or copy, but it's not an issue
>> with something like MemorySegment::fill (as that method only works on a
>> single segment).
>>
>> We hope to be able to make some progress on this issue, as that will
>> allow 3rd party routines on memory segment to enjoy vectorization too w/o
>> the need of having an intrinsics in the JDK.
>>
>> Maurizio
>>
>>
>>
>>
>> On 30/09/2024 13:04, Antoine Chambille wrote:
>>
>> Hello everyone,
>>
>> I've rebuilt the latest OpenJDK (24) from
>> https://github.com/openjdk/panama-vector and run the arrays addition
>> benchmark another time:
>>
>> AddBenchmark
>>  .scalarArrayArray            thrpt    5   6487636 ops/s
>>  .scalarArrayArrayLongStride  thrpt    5   1001515 ops/s
>>  .scalarSegmentArray          thrpt    5   1747531 ops/s
>>  .scalarSegmentSegment        thrpt    5   1154193 ops/s
>>  .scalarUnsafeArray           thrpt    5   6970073 ops/s
>>  .scalarUnsafeUnsafe          thrpt    5   1246625 ops/s
>>  .unrolledArrayArray          thrpt    5   1251824 ops/s
>>  .unrolledSegmentArray        thrpt    5   1694164 ops/s
>>  .unrolledUnsafeArray         thrpt    5   5043685 ops/s
>>  .unrolledUnsafeUnsafe        thrpt    5   1197024 ops/s
>>  .vectorArrayArray            thrpt    5   7200224 ops/s
>>  .vectorArraySegment          thrpt    5   7377553 ops/s
>>  .vectorSegmentArray          thrpt    5   7263505 ops/s
>>  .vectorSegmentSegment        thrpt    5   7143647 ops/s
>>
>>
>>    - Performance using the vector API is now very consistent and good
>>    across arrays and segments.
>>    - Reading and writing from/to segments still seems to be disrupting
>>    auto-vectorization. Reading with Unsafe works well but it's marked for
>>    removal.
>>    - Less important, manual unrolling also seems to be disrupting
>>    auto-vectorization.
>>
>>
>>
>> Best,
>> -Antoine
>>
>> On Tue, Mar 26, 2024 at 5:40 PM Vladimir Ivanov <
>> vladimir.x.ivanov at oracle.com> wrote:
>>
>>>
>>> >> Personally, I prefer to see vectorizer handling "MoveX2Y (LoadX mem)"
>>> >> => "VectorReinterpret (LoadVector mem)" well and then introduce rules
>>> to
>>> >> strength-reduce it to mismatched access.
>>> >
>>> > Do I understand you right that you're saying the vector node for
>>> MoveL2D
>>> > (for instance) is VectorReinterpret so we could vectorize the code.
>>> >
>>> > Are you then suggesting that we can transform:
>>> >
>>> > (VectorReinterpret (LoadVector mem)
>>> >
>>> > into:
>>> >
>>> > (LoadVector mem)
>>> >
>>> > with that LoadVector a mismatched access?
>>>
>>> Yes, but thinking more about it, the latter step may be optional. For
>>> example, VectorReinterpret implementation on x86 is a no-op, so not much
>>> gained from folding VectorReinterpret+LoadVector into a mismatched
>>> LoadVector.
>>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20250929/0dc7beaf/attachment.htm>