<!DOCTYPE html><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<p>Hi Antoine,<br>
Thanks for the reply. All credit here goes to Emanuel (cc'ed). I
believe the main issues with memory segments and autovectorization
were fixed as part of this:</p>
<p><a class="moz-txt-link-freetext" href="https://bugs.openjdk.org/browse/JDK-8324751">https://bugs.openjdk.org/browse/JDK-8324751</a></p>
<p>You might also want to watch his great JVMLS talk:</p>
<p><a class="moz-txt-link-freetext" href="https://inside.java/2025/08/16/jvmls-hotspot-auto-vectorization/">https://inside.java/2025/08/16/jvmls-hotspot-auto-vectorization/</a></p>
<p>Cheers<br>
Maurizio<br>
</p>
<p><br>
</p>
<div class="moz-cite-prefix">On 29/09/2025 10:11, Antoine Chambille
wrote:<br>
</div>
<blockquote type="cite" cite="mid:CAJGQDwmPbKX-9JWu9f=0Zf+G1+B9NC+1LETQ7aSK3njoX96+eA@mail.gmail.com">
<div dir="ltr">Hello,<br>
<br>
I've run the array addition benchmark again, JDK-25 and
JDK-26ea. Looks like the performance issues I’d been tracking
for a while have been solved in JDK 26.<br>
<a href="https://github.com/chamb/panama-benchmarks" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/chamb/panama-benchmarks</a><br>
<br>
Auto vectorisation of scalar loops seems to work when using
MemorySegment and is even faster than with java arrays or the
vector API. Also loops with long stride don't prevent auto
vectorisation anymore.
<div><br>
Not sure exactly who we owe these improvements to, but it's
awesome! Here's another use case where we can confidently
switch from Unsafe to MemorySegment. The dream would be to see
these enhancements land in JDK 25, of course...<br>
<br>
<br>
JDK 25
<div><br>
<font face="monospace">Benchmark
Mode Cnt Score Error Units<br>
AddBenchmark.scalarArrayArray avgt 5
167.028 ± 5.604 ns/op<br>
AddBenchmark.scalarArrayArrayLongStride avgt 5
925.673 ± 37.766 ns/op<br>
AddBenchmark.scalarSegmentArray avgt 5
550.540 ± 3.576 ns/op<br>
AddBenchmark.scalarSegmentSegment avgt 5
548.861 ± 1.852 ns/op<br>
AddBenchmark.scalarUnsafeArray avgt 5
600.489 ± 219.285 ns/op<br>
AddBenchmark.scalarUnsafeUnsafe avgt 5
776.975 ± 11.601 ns/op<br>
AddBenchmark.unrolledArrayArray avgt 5
863.526 ± 58.822 ns/op<br>
AddBenchmark.unrolledSegmentArray avgt 5
584.230 ± 13.863 ns/op<br>
AddBenchmark.unrolledUnsafeArray avgt 5
584.898 ± 15.792 ns/op<br>
AddBenchmark.unrolledUnsafeUnsafe avgt 5
761.445 ± 59.935 ns/op<br>
AddBenchmark.vectorArrayArray avgt 5
177.288 ± 0.653 ns/op<br>
AddBenchmark.vectorArraySegment avgt 5
141.381 ± 1.211 ns/op<br>
AddBenchmark.vectorSegmentArray avgt 5
141.576 ± 3.077 ns/op<br>
AddBenchmark.vectorSegmentSegment avgt 5
217.639 ± 5.076 ns/op</font><br>
<br>
<br>
JDK 26 b17
<div><br>
<font face="monospace">Benchmark
Mode Cnt Score Error Units<br>
AddBenchmark.scalarArrayArray avgt 5
209.653 ± 5.990 ns/op<br>
AddBenchmark.scalarArrayArrayLongStride avgt 5
209.948 ± 12.925 ns/op<br>
<b>AddBenchmark.scalarSegmentArray avgt 5
111.790 ± 5.971 ns/op<br>
AddBenchmark.scalarSegmentSegment avgt 5
136.414 ± 3.900 ns/op</b><br>
AddBenchmark.scalarUnsafeArray avgt 5
657.565 ± 4.705 ns/op<br>
AddBenchmark.scalarUnsafeUnsafe avgt 5
832.016 ± 210.295 ns/op<br>
AddBenchmark.unrolledArrayArray avgt 5
1095.963 ± 153.910 ns/op<br>
AddBenchmark.unrolledSegmentArray avgt 5
138.410 ± 11.933 ns/op<br>
AddBenchmark.unrolledUnsafeArray avgt 5
685.867 ± 27.075 ns/op<br>
AddBenchmark.unrolledUnsafeUnsafe avgt 5
817.802 ± 30.841 ns/op<br>
AddBenchmark.vectorArrayArray avgt 5
149.027 ± 1.269 ns/op<br>
AddBenchmark.vectorArraySegment avgt 5
164.590 ± 7.283 ns/op<br>
AddBenchmark.vectorSegmentArray avgt 5
196.908 ± 5.610 ns/op<br>
AddBenchmark.vectorSegmentSegment avgt 5
242.377 ± 5.488 ns/op</font>
<div><font face="monospace"><br>
</font></div>
<div><font face="monospace"><br>
</font></div>
<div><font face="monospace">Best,</font></div>
<div><font face="monospace">-Antoine</font></div>
</div>
</div>
</div>
</div>
<br>
<div class="gmail_quote gmail_quote_container">
<div dir="ltr" class="gmail_attr">On Mon, Sep 30, 2024 at
2:16 PM Antoine Chambille <<a href="mailto:ach@activeviam.com" moz-do-not-send="true" class="moz-txt-link-freetext">ach@activeviam.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">Hi Maurizio, thanks for the quick response.
Looking forward to it.<br>
<div>-Antoine</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Mon, Sep 30, 2024 at
2:11 PM Maurizio Cimadamore <<a href="mailto:maurizio.cimadamore@oracle.com" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">maurizio.cimadamore@oracle.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>Hi Antoine,<br>
auto-vectorization on memory segments doesn't work in
some cases. This issue is mostly due to:</p>
<p><a href="https://bugs.openjdk.org/browse/JDK-8324751" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://bugs.openjdk.org/browse/JDK-8324751</a></p>
<p>That is, when working with a "source" and a "target"
segment, if the auto-vectorizer cannot prove that the
two segments are disjoint, no vectorization occurs.</p>
<p>This is an issue for operations like add, or copy,
but it's not an issue with something like
MemorySegment::fill (as that method only works on a
single segment).</p>
<p>We hope to be able to make some progress on this
issue, as that will allow 3rd party routines on memory
segment to enjoy vectorization too w/o the need of
having an intrinsics in the JDK.</p>
<p>Maurizio<br>
</p>
<p><br>
</p>
<p><br>
</p>
<p><br>
</p>
<div>On 30/09/2024 13:04, Antoine Chambille wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">Hello everyone,<br>
<br>
I've rebuilt the latest OpenJDK (24) from <a href="https://github.com/openjdk/panama-vector" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/openjdk/panama-vector</a>
and run the arrays addition benchmark another time:<br>
<br>
<font face="monospace">AddBenchmark<br>
.scalarArrayArray thrpt 5 6487636
ops/s<br>
.scalarArrayArrayLongStride thrpt 5 1001515
ops/s<br>
.scalarSegmentArray thrpt 5 1747531
ops/s<br>
.scalarSegmentSegment thrpt 5 1154193
ops/s<br>
.scalarUnsafeArray thrpt 5 6970073
ops/s<br>
.scalarUnsafeUnsafe thrpt 5 1246625
ops/s<br>
.unrolledArrayArray thrpt 5 1251824
ops/s<br>
.unrolledSegmentArray thrpt 5 1694164
ops/s<br>
.unrolledUnsafeArray thrpt 5 5043685
ops/s<br>
.unrolledUnsafeUnsafe thrpt 5 1197024
ops/s<br>
.vectorArrayArray thrpt 5 7200224
ops/s<br>
.vectorArraySegment thrpt 5 7377553
ops/s<br>
.vectorSegmentArray thrpt 5 7263505
ops/s<br>
.vectorSegmentSegment thrpt 5 7143647
ops/s</font><br>
<br>
<ul>
<li>Performance using the vector API is now very
consistent and good across arrays and segments.</li>
<li>Reading and writing from/to segments still
seems to be disrupting auto-vectorization.
Reading with Unsafe works well but it's marked
for removal.</li>
<li>Less important, manual unrolling also seems to
be disrupting auto-vectorization.</li>
</ul>
<br>
<br>
Best,<br>
-Antoine<br>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Tue, Mar 26,
2024 at 5:40 PM Vladimir Ivanov <<a href="mailto:vladimir.x.ivanov@oracle.com" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">vladimir.x.ivanov@oracle.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>
>> Personally, I prefer to see vectorizer
handling "MoveX2Y (LoadX mem)"<br>
>> => "VectorReinterpret (LoadVector
mem)" well and then introduce rules to<br>
>> strength-reduce it to mismatched access.<br>
> <br>
> Do I understand you right that you're saying
the vector node for MoveL2D<br>
> (for instance) is VectorReinterpret so we
could vectorize the code.<br>
> <br>
> Are you then suggesting that we can
transform:<br>
> <br>
> (VectorReinterpret (LoadVector mem)<br>
> <br>
> into:<br>
> <br>
> (LoadVector mem)<br>
> <br>
> with that LoadVector a mismatched access?<br>
<br>
Yes, but thinking more about it, the latter step
may be optional. For <br>
example, VectorReinterpret implementation on x86
is a no-op, so not much <br>
gained from folding VectorReinterpret+LoadVector
into a mismatched <br>
LoadVector.<br>
<br>
Best regards,<br>
Vladimir Ivanov<br>
</blockquote>
</div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote>
</body>
</html>