<div dir="ltr">Hello,<br><br>I've run the array addition benchmark again, JDK-25 and JDK-26ea. Looks like the performance issues I’d been tracking for a while have been solved in JDK 26.<br><a href="https://github.com/chamb/panama-benchmarks">https://github.com/chamb/panama-benchmarks</a><br><br>Auto vectorisation of scalar loops seems to work when using MemorySegment and is even faster than with java arrays or the vector API. Also loops with long stride don't prevent auto vectorisation anymore.<div><br>Not sure exactly who we owe these improvements to, but it's awesome! Here's another use case where we can confidently switch from Unsafe to MemorySegment. The dream would be to see these enhancements land in JDK 25, of course...<br><br><br>JDK 25<div><br><font face="monospace">Benchmark                                Mode  Cnt     Score     Error  Units<br>AddBenchmark.scalarArrayArray            avgt    5   167.028 ±   5.604  ns/op<br>AddBenchmark.scalarArrayArrayLongStride  avgt    5   925.673 ±  37.766  ns/op<br>AddBenchmark.scalarSegmentArray          avgt    5   550.540 ±   3.576  ns/op<br>AddBenchmark.scalarSegmentSegment        avgt    5   548.861 ±   1.852  ns/op<br>AddBenchmark.scalarUnsafeArray           avgt    5   600.489 ± 219.285  ns/op<br>AddBenchmark.scalarUnsafeUnsafe          avgt    5   776.975 ±  11.601  ns/op<br>AddBenchmark.unrolledArrayArray          avgt    5   863.526 ±  58.822  ns/op<br>AddBenchmark.unrolledSegmentArray        avgt    5   584.230 ±  13.863  ns/op<br>AddBenchmark.unrolledUnsafeArray         avgt    5   584.898 ±  15.792  ns/op<br>AddBenchmark.unrolledUnsafeUnsafe        avgt    5   761.445 ±  59.935  ns/op<br>AddBenchmark.vectorArrayArray            avgt    5   177.288 ±   0.653  ns/op<br>AddBenchmark.vectorArraySegment          avgt    5   141.381 ±   1.211  ns/op<br>AddBenchmark.vectorSegmentArray          avgt    5   141.576 ±   3.077  ns/op<br>AddBenchmark.vectorSegmentSegment        avgt    5   217.639 ±   5.076  ns/op</font><br><br><br>JDK 26 b17<div><br><font face="monospace">Benchmark                                Mode  Cnt     Score     Error  Units<br>AddBenchmark.scalarArrayArray            avgt    5   209.653 ±   5.990  ns/op<br>AddBenchmark.scalarArrayArrayLongStride  avgt    5   209.948 ±  12.925  ns/op<br><b>AddBenchmark.scalarSegmentArray          avgt    5   111.790 ±   5.971  ns/op<br>AddBenchmark.scalarSegmentSegment        avgt    5   136.414 ±   3.900  ns/op</b><br>AddBenchmark.scalarUnsafeArray           avgt    5   657.565 ±   4.705  ns/op<br>AddBenchmark.scalarUnsafeUnsafe          avgt    5   832.016 ± 210.295  ns/op<br>AddBenchmark.unrolledArrayArray          avgt    5  1095.963 ± 153.910  ns/op<br>AddBenchmark.unrolledSegmentArray        avgt    5   138.410 ±  11.933  ns/op<br>AddBenchmark.unrolledUnsafeArray         avgt    5   685.867 ±  27.075  ns/op<br>AddBenchmark.unrolledUnsafeUnsafe        avgt    5   817.802 ±  30.841  ns/op<br>AddBenchmark.vectorArrayArray            avgt    5   149.027 ±   1.269  ns/op<br>AddBenchmark.vectorArraySegment          avgt    5   164.590 ±   7.283  ns/op<br>AddBenchmark.vectorSegmentArray          avgt    5   196.908 ±   5.610  ns/op<br>AddBenchmark.vectorSegmentSegment        avgt    5   242.377 ±   5.488  ns/op</font><div><font face="monospace"><br></font></div><div><font face="monospace"><br></font></div><div><font face="monospace">Best,</font></div><div><font face="monospace">-Antoine</font></div></div></div></div></div><br><div class="gmail_quote gmail_quote_container"><div dir="ltr" class="gmail_attr">On Mon, Sep 30, 2024 at 2:16 PM Antoine Chambille <<a href="mailto:ach@activeviam.com">ach@activeviam.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Hi Maurizio, thanks for the quick response. Looking forward to it.<br><div>-Antoine</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Sep 30, 2024 at 2:11 PM Maurizio Cimadamore <<a href="mailto:maurizio.cimadamore@oracle.com" target="_blank">maurizio.cimadamore@oracle.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><u></u>

  <div>

    <p>Hi Antoine,<br>

      auto-vectorization on memory segments doesn't work in some cases.

      This issue is mostly due to:</p>

    <p><a href="https://bugs.openjdk.org/browse/JDK-8324751" target="_blank">https://bugs.openjdk.org/browse/JDK-8324751</a></p>

    <p>That is, when working with a "source" and a "target" segment, if

      the auto-vectorizer cannot prove that the two segments are

      disjoint, no vectorization occurs.</p>

    <p>This is an issue for operations like add, or copy, but it's not

      an issue with something like MemorySegment::fill (as that method

      only works on a single segment).</p>

    <p>We hope to be able to make some progress on this issue, as that

      will allow 3rd party routines on memory segment to enjoy

      vectorization too w/o the need of having an intrinsics in the JDK.</p>

    <p>Maurizio<br>

    </p>

    <p><br>

    </p>

    <p><br>

    </p>

    <p><br>

    </p>

    <div>On 30/09/2024 13:04, Antoine Chambille

      wrote:<br>

    </div>

    <blockquote type="cite">

      <div dir="ltr">Hello everyone,<br>

        <br>

        I've rebuilt the latest OpenJDK (24) from <a href="https://github.com/openjdk/panama-vector" target="_blank">https://github.com/openjdk/panama-vector</a>

        and run the arrays addition benchmark another time:<br>

        <br>

        <font face="monospace">AddBenchmark<br>

           .scalarArrayArray            thrpt    5   6487636 ops/s<br>

           .scalarArrayArrayLongStride  thrpt    5   1001515 ops/s<br>

           .scalarSegmentArray          thrpt    5   1747531 ops/s<br>

           .scalarSegmentSegment        thrpt    5   1154193 ops/s<br>

           .scalarUnsafeArray           thrpt    5   6970073 ops/s<br>

           .scalarUnsafeUnsafe          thrpt    5   1246625 ops/s<br>

           .unrolledArrayArray          thrpt    5   1251824 ops/s<br>

           .unrolledSegmentArray        thrpt    5   1694164 ops/s<br>

           .unrolledUnsafeArray         thrpt    5   5043685 ops/s<br>

           .unrolledUnsafeUnsafe        thrpt    5   1197024 ops/s<br>

           .vectorArrayArray            thrpt    5   7200224 ops/s<br>

           .vectorArraySegment          thrpt    5   7377553 ops/s<br>

           .vectorSegmentArray          thrpt    5   7263505 ops/s<br>

           .vectorSegmentSegment        thrpt    5   7143647 ops/s</font><br>

        <br>

        <ul>

          <li>Performance using the vector API is now very consistent

            and good across arrays and segments.</li>

          <li>Reading and writing from/to segments still seems to be

            disrupting auto-vectorization. Reading with Unsafe works

            well but it's marked for removal.</li>

          <li>Less important, manual unrolling also seems to be

            disrupting auto-vectorization.</li>

        </ul>

        <br>

        <br>

        Best,<br>

        -Antoine<br>

      </div>

      <br>

      <div class="gmail_quote">

        <div dir="ltr" class="gmail_attr">On Tue, Mar 26, 2024 at

          5:40 PM Vladimir Ivanov <<a href="mailto:vladimir.x.ivanov@oracle.com" target="_blank">vladimir.x.ivanov@oracle.com</a>>

          wrote:<br>

        </div>

        <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>

          >> Personally, I prefer to see vectorizer handling

          "MoveX2Y (LoadX mem)"<br>

          >> => "VectorReinterpret (LoadVector mem)" well and

          then introduce rules to<br>

          >> strength-reduce it to mismatched access.<br>

          > <br>

          > Do I understand you right that you're saying the vector

          node for MoveL2D<br>

          > (for instance) is VectorReinterpret so we could vectorize

          the code.<br>

          > <br>

          > Are you then suggesting that we can transform:<br>

          > <br>

          > (VectorReinterpret (LoadVector mem)<br>

          > <br>

          > into:<br>

          > <br>

          > (LoadVector mem)<br>

          > <br>

          > with that LoadVector a mismatched access?<br>

          <br>

          Yes, but thinking more about it, the latter step may be

          optional. For <br>

          example, VectorReinterpret implementation on x86 is a no-op,

          so not much <br>

          gained from folding VectorReinterpret+LoadVector into a

          mismatched <br>

          LoadVector.<br>

          <br>

          Best regards,<br>

          Vladimir Ivanov<br>

        </blockquote>

      </div>

    </blockquote>

  </div>

</blockquote></div>

</blockquote></div>