<div dir="ltr">Hello,<br><br>I've run the array addition benchmark again, JDK-25 and JDK-26ea. Looks like the performance issues I’d been tracking for a while have been solved in JDK 26.<br><a href="https://github.com/chamb/panama-benchmarks">https://github.com/chamb/panama-benchmarks</a><br><br>Auto vectorisation of scalar loops seems to work when using MemorySegment and is even faster than with java arrays or the vector API. Also loops with long stride don't prevent auto vectorisation anymore.<div><br>Not sure exactly who we owe these improvements to, but it's awesome! Here's another use case where we can confidently switch from Unsafe to MemorySegment. The dream would be to see these enhancements land in JDK 25, of course...<br><br><br>JDK 25<div><br><font face="monospace">Benchmark Mode Cnt Score Error Units<br>AddBenchmark.scalarArrayArray avgt 5 167.028 ± 5.604 ns/op<br>AddBenchmark.scalarArrayArrayLongStride avgt 5 925.673 ± 37.766 ns/op<br>AddBenchmark.scalarSegmentArray avgt 5 550.540 ± 3.576 ns/op<br>AddBenchmark.scalarSegmentSegment avgt 5 548.861 ± 1.852 ns/op<br>AddBenchmark.scalarUnsafeArray avgt 5 600.489 ± 219.285 ns/op<br>AddBenchmark.scalarUnsafeUnsafe avgt 5 776.975 ± 11.601 ns/op<br>AddBenchmark.unrolledArrayArray avgt 5 863.526 ± 58.822 ns/op<br>AddBenchmark.unrolledSegmentArray avgt 5 584.230 ± 13.863 ns/op<br>AddBenchmark.unrolledUnsafeArray avgt 5 584.898 ± 15.792 ns/op<br>AddBenchmark.unrolledUnsafeUnsafe avgt 5 761.445 ± 59.935 ns/op<br>AddBenchmark.vectorArrayArray avgt 5 177.288 ± 0.653 ns/op<br>AddBenchmark.vectorArraySegment avgt 5 141.381 ± 1.211 ns/op<br>AddBenchmark.vectorSegmentArray avgt 5 141.576 ± 3.077 ns/op<br>AddBenchmark.vectorSegmentSegment avgt 5 217.639 ± 5.076 ns/op</font><br><br><br>JDK 26 b17<div><br><font face="monospace">Benchmark Mode Cnt Score Error Units<br>AddBenchmark.scalarArrayArray avgt 5 209.653 ± 5.990 ns/op<br>AddBenchmark.scalarArrayArrayLongStride avgt 5 209.948 ± 12.925 ns/op<br><b>AddBenchmark.scalarSegmentArray avgt 5 111.790 ± 5.971 ns/op<br>AddBenchmark.scalarSegmentSegment avgt 5 136.414 ± 3.900 ns/op</b><br>AddBenchmark.scalarUnsafeArray avgt 5 657.565 ± 4.705 ns/op<br>AddBenchmark.scalarUnsafeUnsafe avgt 5 832.016 ± 210.295 ns/op<br>AddBenchmark.unrolledArrayArray avgt 5 1095.963 ± 153.910 ns/op<br>AddBenchmark.unrolledSegmentArray avgt 5 138.410 ± 11.933 ns/op<br>AddBenchmark.unrolledUnsafeArray avgt 5 685.867 ± 27.075 ns/op<br>AddBenchmark.unrolledUnsafeUnsafe avgt 5 817.802 ± 30.841 ns/op<br>AddBenchmark.vectorArrayArray avgt 5 149.027 ± 1.269 ns/op<br>AddBenchmark.vectorArraySegment avgt 5 164.590 ± 7.283 ns/op<br>AddBenchmark.vectorSegmentArray avgt 5 196.908 ± 5.610 ns/op<br>AddBenchmark.vectorSegmentSegment avgt 5 242.377 ± 5.488 ns/op</font><div><font face="monospace"><br></font></div><div><font face="monospace"><br></font></div><div><font face="monospace">Best,</font></div><div><font face="monospace">-Antoine</font></div></div></div></div></div><br><div class="gmail_quote gmail_quote_container"><div dir="ltr" class="gmail_attr">On Mon, Sep 30, 2024 at 2:16 PM Antoine Chambille <<a href="mailto:ach@activeviam.com">ach@activeviam.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi Maurizio, thanks for the quick response. Looking forward to it.<br><div>-Antoine</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Sep 30, 2024 at 2:11 PM Maurizio Cimadamore <<a href="mailto:maurizio.cimadamore@oracle.com" target="_blank">maurizio.cimadamore@oracle.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><u></u>
<div>
<p>Hi Antoine,<br>
auto-vectorization on memory segments doesn't work in some cases.
This issue is mostly due to:</p>
<p><a href="https://bugs.openjdk.org/browse/JDK-8324751" target="_blank">https://bugs.openjdk.org/browse/JDK-8324751</a></p>
<p>That is, when working with a "source" and a "target" segment, if
the auto-vectorizer cannot prove that the two segments are
disjoint, no vectorization occurs.</p>
<p>This is an issue for operations like add, or copy, but it's not
an issue with something like MemorySegment::fill (as that method
only works on a single segment).</p>
<p>We hope to be able to make some progress on this issue, as that
will allow 3rd party routines on memory segment to enjoy
vectorization too w/o the need of having an intrinsics in the JDK.</p>
<p>Maurizio<br>
</p>
<p><br>
</p>
<p><br>
</p>
<p><br>
</p>
<div>On 30/09/2024 13:04, Antoine Chambille
wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">Hello everyone,<br>
<br>
I've rebuilt the latest OpenJDK (24) from <a href="https://github.com/openjdk/panama-vector" target="_blank">https://github.com/openjdk/panama-vector</a>
and run the arrays addition benchmark another time:<br>
<br>
<font face="monospace">AddBenchmark<br>
.scalarArrayArray thrpt 5 6487636 ops/s<br>
.scalarArrayArrayLongStride thrpt 5 1001515 ops/s<br>
.scalarSegmentArray thrpt 5 1747531 ops/s<br>
.scalarSegmentSegment thrpt 5 1154193 ops/s<br>
.scalarUnsafeArray thrpt 5 6970073 ops/s<br>
.scalarUnsafeUnsafe thrpt 5 1246625 ops/s<br>
.unrolledArrayArray thrpt 5 1251824 ops/s<br>
.unrolledSegmentArray thrpt 5 1694164 ops/s<br>
.unrolledUnsafeArray thrpt 5 5043685 ops/s<br>
.unrolledUnsafeUnsafe thrpt 5 1197024 ops/s<br>
.vectorArrayArray thrpt 5 7200224 ops/s<br>
.vectorArraySegment thrpt 5 7377553 ops/s<br>
.vectorSegmentArray thrpt 5 7263505 ops/s<br>
.vectorSegmentSegment thrpt 5 7143647 ops/s</font><br>
<br>
<ul>
<li>Performance using the vector API is now very consistent
and good across arrays and segments.</li>
<li>Reading and writing from/to segments still seems to be
disrupting auto-vectorization. Reading with Unsafe works
well but it's marked for removal.</li>
<li>Less important, manual unrolling also seems to be
disrupting auto-vectorization.</li>
</ul>
<br>
<br>
Best,<br>
-Antoine<br>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Tue, Mar 26, 2024 at
5:40 PM Vladimir Ivanov <<a href="mailto:vladimir.x.ivanov@oracle.com" target="_blank">vladimir.x.ivanov@oracle.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>
>> Personally, I prefer to see vectorizer handling
"MoveX2Y (LoadX mem)"<br>
>> => "VectorReinterpret (LoadVector mem)" well and
then introduce rules to<br>
>> strength-reduce it to mismatched access.<br>
> <br>
> Do I understand you right that you're saying the vector
node for MoveL2D<br>
> (for instance) is VectorReinterpret so we could vectorize
the code.<br>
> <br>
> Are you then suggesting that we can transform:<br>
> <br>
> (VectorReinterpret (LoadVector mem)<br>
> <br>
> into:<br>
> <br>
> (LoadVector mem)<br>
> <br>
> with that LoadVector a mismatched access?<br>
<br>
Yes, but thinking more about it, the latter step may be
optional. For <br>
example, VectorReinterpret implementation on x86 is a no-op,
so not much <br>
gained from folding VectorReinterpret+LoadVector into a
mismatched <br>
LoadVector.<br>
<br>
Best regards,<br>
Vladimir Ivanov<br>
</blockquote>
</div>
</blockquote>
</div>
</blockquote></div>
</blockquote></div>