Array addition and array sum Panama benchmarks

Mon Sep 29 09:26:42 UTC 2025

Hi Antoine,
Thanks for the reply. All credit here goes to Emanuel (cc'ed). I believe 
the main issues with memory segments and autovectorization were fixed as 
part of this:

https://bugs.openjdk.org/browse/JDK-8324751

You might also want to watch his great JVMLS talk:

https://inside.java/2025/08/16/jvmls-hotspot-auto-vectorization/

Cheers
Maurizio

On 29/09/2025 10:11, Antoine Chambille wrote:
> Hello,
>
> I've run the array addition benchmark again, JDK-25 and JDK-26ea. 
> Looks like the performance issues I’d been tracking for a while have 
> been solved in JDK 26.
> https://github.com/chamb/panama-benchmarks
>
> Auto vectorisation of scalar loops seems to work when using 
> MemorySegment and is even faster than with java arrays or the vector 
> API. Also loops with long stride don't prevent auto vectorisation 
> anymore.
>
> Not sure exactly who we owe these improvements to, but it's awesome! 
> Here's another use case where we can confidently switch from Unsafe to 
> MemorySegment. The dream would be to see these enhancements land in 
> JDK 25, of course...
>
>
> JDK 25
>
> Benchmark    Mode  Cnt     Score     Error  Units
> AddBenchmark.scalarArrayArray            avgt    5 167.028 ±   5.604 
>  ns/op
> AddBenchmark.scalarArrayArrayLongStride  avgt    5 925.673 ±  37.766 
>  ns/op
> AddBenchmark.scalarSegmentArray          avgt    5 550.540 ±   3.576 
>  ns/op
> AddBenchmark.scalarSegmentSegment        avgt    5 548.861 ±   1.852 
>  ns/op
> AddBenchmark.scalarUnsafeArray           avgt    5 600.489 ± 219.285 
>  ns/op
> AddBenchmark.scalarUnsafeUnsafe          avgt    5 776.975 ±  11.601 
>  ns/op
> AddBenchmark.unrolledArrayArray          avgt    5 863.526 ±  58.822 
>  ns/op
> AddBenchmark.unrolledSegmentArray        avgt    5 584.230 ±  13.863 
>  ns/op
> AddBenchmark.unrolledUnsafeArray         avgt    5 584.898 ±  15.792 
>  ns/op
> AddBenchmark.unrolledUnsafeUnsafe        avgt    5 761.445 ±  59.935 
>  ns/op
> AddBenchmark.vectorArrayArray            avgt    5 177.288 ±   0.653 
>  ns/op
> AddBenchmark.vectorArraySegment          avgt    5 141.381 ±   1.211 
>  ns/op
> AddBenchmark.vectorSegmentArray          avgt    5 141.576 ±   3.077 
>  ns/op
> AddBenchmark.vectorSegmentSegment        avgt    5 217.639 ±   5.076 
>  ns/op
>
>
> JDK 26 b17
>
> Benchmark      Mode  Cnt     Score     Error  Units
> AddBenchmark.scalarArrayArray            avgt    5 209.653 ±   5.990 
>  ns/op
> AddBenchmark.scalarArrayArrayLongStride  avgt    5 209.948 ±  12.925 
>  ns/op
> *AddBenchmark.scalarSegmentArray          avgt    5 111.790 ±   5.971 
>  ns/op
> AddBenchmark.scalarSegmentSegment        avgt    5 136.414 ±   3.900 
>  ns/op*
> AddBenchmark.scalarUnsafeArray           avgt    5 657.565 ±   4.705 
>  ns/op
> AddBenchmark.scalarUnsafeUnsafe          avgt    5 832.016 ± 210.295 
>  ns/op
> AddBenchmark.unrolledArrayArray          avgt    5  1095.963 ± 153.910 
>  ns/op
> AddBenchmark.unrolledSegmentArray        avgt    5 138.410 ±  11.933 
>  ns/op
> AddBenchmark.unrolledUnsafeArray         avgt    5 685.867 ±  27.075 
>  ns/op
> AddBenchmark.unrolledUnsafeUnsafe        avgt    5 817.802 ±  30.841 
>  ns/op
> AddBenchmark.vectorArrayArray            avgt    5 149.027 ±   1.269 
>  ns/op
> AddBenchmark.vectorArraySegment          avgt    5 164.590 ±   7.283 
>  ns/op
> AddBenchmark.vectorSegmentArray          avgt    5 196.908 ±   5.610 
>  ns/op
> AddBenchmark.vectorSegmentSegment        avgt    5 242.377 ±   5.488 
>  ns/op
>
>
> Best,
> -Antoine
>
> On Mon, Sep 30, 2024 at 2:16 PM Antoine Chambille <ach at activeviam.com> 
> wrote:
>
>     Hi Maurizio, thanks for the quick response. Looking forward to it.
>     -Antoine
>
>     On Mon, Sep 30, 2024 at 2:11 PM Maurizio Cimadamore
>     <maurizio.cimadamore at oracle.com> wrote:
>
>         Hi Antoine,
>         auto-vectorization on memory segments doesn't work in some
>         cases. This issue is mostly due to:
>
>         https://bugs.openjdk.org/browse/JDK-8324751
>
>         That is, when working with a "source" and a "target" segment,
>         if the auto-vectorizer cannot prove that the two segments are
>         disjoint, no vectorization occurs.
>
>         This is an issue for operations like add, or copy, but it's
>         not an issue with something like MemorySegment::fill (as that
>         method only works on a single segment).
>
>         We hope to be able to make some progress on this issue, as
>         that will allow 3rd party routines on memory segment to enjoy
>         vectorization too w/o the need of having an intrinsics in the JDK.
>
>         Maurizio
>
>
>
>
>         On 30/09/2024 13:04, Antoine Chambille wrote:
>>         Hello everyone,
>>
>>         I've rebuilt the latest OpenJDK (24) from
>>         https://github.com/openjdk/panama-vector and run the arrays
>>         addition benchmark another time:
>>
>>         AddBenchmark
>>          .scalarArrayArray            thrpt    5   6487636 ops/s
>>          .scalarArrayArrayLongStride  thrpt    5   1001515 ops/s
>>          .scalarSegmentArray          thrpt    5   1747531 ops/s
>>          .scalarSegmentSegment        thrpt    5   1154193 ops/s
>>          .scalarUnsafeArray           thrpt    5   6970073 ops/s
>>          .scalarUnsafeUnsafe          thrpt    5   1246625 ops/s
>>          .unrolledArrayArray          thrpt    5   1251824 ops/s
>>          .unrolledSegmentArray        thrpt    5   1694164 ops/s
>>          .unrolledUnsafeArray         thrpt    5   5043685 ops/s
>>          .unrolledUnsafeUnsafe        thrpt    5   1197024 ops/s
>>          .vectorArrayArray            thrpt    5   7200224 ops/s
>>          .vectorArraySegment          thrpt    5   7377553 ops/s
>>          .vectorSegmentArray          thrpt    5   7263505 ops/s
>>          .vectorSegmentSegment        thrpt    5   7143647 ops/s
>>
>>           * Performance using the vector API is now very consistent
>>             and good across arrays and segments.
>>           * Reading and writing from/to segments still seems to be
>>             disrupting auto-vectorization. Reading with Unsafe works
>>             well but it's marked for removal.
>>           * Less important, manual unrolling also seems to be
>>             disrupting auto-vectorization.
>>
>>
>>
>>         Best,
>>         -Antoine
>>
>>         On Tue, Mar 26, 2024 at 5:40 PM Vladimir Ivanov
>>         <vladimir.x.ivanov at oracle.com> wrote:
>>
>>
>>             >> Personally, I prefer to see vectorizer handling
>>             "MoveX2Y (LoadX mem)"
>>             >> => "VectorReinterpret (LoadVector mem)" well and then
>>             introduce rules to
>>             >> strength-reduce it to mismatched access.
>>             >
>>             > Do I understand you right that you're saying the vector
>>             node for MoveL2D
>>             > (for instance) is VectorReinterpret so we could
>>             vectorize the code.
>>             >
>>             > Are you then suggesting that we can transform:
>>             >
>>             > (VectorReinterpret (LoadVector mem)
>>             >
>>             > into:
>>             >
>>             > (LoadVector mem)
>>             >
>>             > with that LoadVector a mismatched access?
>>
>>             Yes, but thinking more about it, the latter step may be
>>             optional. For
>>             example, VectorReinterpret implementation on x86 is a
>>             no-op, so not much
>>             gained from folding VectorReinterpret+LoadVector into a
>>             mismatched
>>             LoadVector.
>>
>>             Best regards,
>>             Vladimir Ivanov
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20250929/3b0eceef/attachment-0001.htm>