<div dir="ltr">Hi Maurizio, thanks for the quick response. Looking forward to it.<br><div>-Antoine</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Sep 30, 2024 at 2:11 PM Maurizio Cimadamore <<a href="mailto:maurizio.cimadamore@oracle.com">maurizio.cimadamore@oracle.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><u></u>
<div>
<p>Hi Antoine,<br>
auto-vectorization on memory segments doesn't work in some cases.
This issue is mostly due to:</p>
<p><a href="https://bugs.openjdk.org/browse/JDK-8324751" target="_blank">https://bugs.openjdk.org/browse/JDK-8324751</a></p>
<p>That is, when working with a "source" and a "target" segment, if
the auto-vectorizer cannot prove that the two segments are
disjoint, no vectorization occurs.</p>
<p>This is an issue for operations like add, or copy, but it's not
an issue with something like MemorySegment::fill (as that method
only works on a single segment).</p>
<p>We hope to be able to make some progress on this issue, as that
will allow 3rd party routines on memory segment to enjoy
vectorization too w/o the need of having an intrinsics in the JDK.</p>
<p>Maurizio<br>
</p>
<p><br>
</p>
<p><br>
</p>
<p><br>
</p>
<div>On 30/09/2024 13:04, Antoine Chambille
wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">Hello everyone,<br>
<br>
I've rebuilt the latest OpenJDK (24) from <a href="https://github.com/openjdk/panama-vector" target="_blank">https://github.com/openjdk/panama-vector</a>
and run the arrays addition benchmark another time:<br>
<br>
<font face="monospace">AddBenchmark<br>
.scalarArrayArray thrpt 5 6487636 ops/s<br>
.scalarArrayArrayLongStride thrpt 5 1001515 ops/s<br>
.scalarSegmentArray thrpt 5 1747531 ops/s<br>
.scalarSegmentSegment thrpt 5 1154193 ops/s<br>
.scalarUnsafeArray thrpt 5 6970073 ops/s<br>
.scalarUnsafeUnsafe thrpt 5 1246625 ops/s<br>
.unrolledArrayArray thrpt 5 1251824 ops/s<br>
.unrolledSegmentArray thrpt 5 1694164 ops/s<br>
.unrolledUnsafeArray thrpt 5 5043685 ops/s<br>
.unrolledUnsafeUnsafe thrpt 5 1197024 ops/s<br>
.vectorArrayArray thrpt 5 7200224 ops/s<br>
.vectorArraySegment thrpt 5 7377553 ops/s<br>
.vectorSegmentArray thrpt 5 7263505 ops/s<br>
.vectorSegmentSegment thrpt 5 7143647 ops/s</font><br>
<br>
<ul>
<li>Performance using the vector API is now very consistent
and good across arrays and segments.</li>
<li>Reading and writing from/to segments still seems to be
disrupting auto-vectorization. Reading with Unsafe works
well but it's marked for removal.</li>
<li>Less important, manual unrolling also seems to be
disrupting auto-vectorization.</li>
</ul>
<br>
<br>
Best,<br>
-Antoine<br>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Tue, Mar 26, 2024 at
5:40 PM Vladimir Ivanov <<a href="mailto:vladimir.x.ivanov@oracle.com" target="_blank">vladimir.x.ivanov@oracle.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>
>> Personally, I prefer to see vectorizer handling
"MoveX2Y (LoadX mem)"<br>
>> => "VectorReinterpret (LoadVector mem)" well and
then introduce rules to<br>
>> strength-reduce it to mismatched access.<br>
> <br>
> Do I understand you right that you're saying the vector
node for MoveL2D<br>
> (for instance) is VectorReinterpret so we could vectorize
the code.<br>
> <br>
> Are you then suggesting that we can transform:<br>
> <br>
> (VectorReinterpret (LoadVector mem)<br>
> <br>
> into:<br>
> <br>
> (LoadVector mem)<br>
> <br>
> with that LoadVector a mismatched access?<br>
<br>
Yes, but thinking more about it, the latter step may be
optional. For <br>
example, VectorReinterpret implementation on x86 is a no-op,
so not much <br>
gained from folding VectorReinterpret+LoadVector into a
mismatched <br>
LoadVector.<br>
<br>
Best regards,<br>
Vladimir Ivanov<br>
</blockquote>
</div>
</blockquote>
</div>
</blockquote></div>