<!DOCTYPE html><html><head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body>

    <p>Hi Antoine,<br>

      Thanks for the reply. All credit here goes to Emanuel (cc'ed). I

      believe the main issues with memory segments and autovectorization

      were fixed as part of this:</p>

    <p><a class="moz-txt-link-freetext" href="https://bugs.openjdk.org/browse/JDK-8324751">https://bugs.openjdk.org/browse/JDK-8324751</a></p>

    <p>You might also want to watch his great JVMLS talk:</p>

    <p><a class="moz-txt-link-freetext" href="https://inside.java/2025/08/16/jvmls-hotspot-auto-vectorization/">https://inside.java/2025/08/16/jvmls-hotspot-auto-vectorization/</a></p>

    <p>Cheers<br>

      Maurizio<br>

    </p>

    <p><br>

    </p>

    <div class="moz-cite-prefix">On 29/09/2025 10:11, Antoine Chambille

      wrote:<br>

    </div>

    <blockquote type="cite" cite="mid:CAJGQDwmPbKX-9JWu9f=0Zf+G1+B9NC+1LETQ7aSK3njoX96+eA@mail.gmail.com">

      <div dir="ltr">Hello,<br>

        <br>

        I've run the array addition benchmark again, JDK-25 and

        JDK-26ea. Looks like the performance issues I’d been tracking

        for a while have been solved in JDK 26.<br>

        <a href="https://github.com/chamb/panama-benchmarks" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/chamb/panama-benchmarks</a><br>

        <br>

        Auto vectorisation of scalar loops seems to work when using

        MemorySegment and is even faster than with java arrays or the

        vector API. Also loops with long stride don't prevent auto

        vectorisation anymore.

        <div><br>

          Not sure exactly who we owe these improvements to, but it's

          awesome! Here's another use case where we can confidently

          switch from Unsafe to MemorySegment. The dream would be to see

          these enhancements land in JDK 25, of course...<br>

          <br>

          <br>

          JDK 25

          <div><br>

            <font face="monospace">Benchmark                            

                 Mode  Cnt     Score     Error  Units<br>

              AddBenchmark.scalarArrayArray            avgt    5  

              167.028 ±   5.604  ns/op<br>

              AddBenchmark.scalarArrayArrayLongStride  avgt    5  

              925.673 ±  37.766  ns/op<br>

              AddBenchmark.scalarSegmentArray          avgt    5  

              550.540 ±   3.576  ns/op<br>

              AddBenchmark.scalarSegmentSegment        avgt    5  

              548.861 ±   1.852  ns/op<br>

              AddBenchmark.scalarUnsafeArray           avgt    5  

              600.489 ± 219.285  ns/op<br>

              AddBenchmark.scalarUnsafeUnsafe          avgt    5  

              776.975 ±  11.601  ns/op<br>

              AddBenchmark.unrolledArrayArray          avgt    5  

              863.526 ±  58.822  ns/op<br>

              AddBenchmark.unrolledSegmentArray        avgt    5  

              584.230 ±  13.863  ns/op<br>

              AddBenchmark.unrolledUnsafeArray         avgt    5  

              584.898 ±  15.792  ns/op<br>

              AddBenchmark.unrolledUnsafeUnsafe        avgt    5  

              761.445 ±  59.935  ns/op<br>

              AddBenchmark.vectorArrayArray            avgt    5  

              177.288 ±   0.653  ns/op<br>

              AddBenchmark.vectorArraySegment          avgt    5  

              141.381 ±   1.211  ns/op<br>

              AddBenchmark.vectorSegmentArray          avgt    5  

              141.576 ±   3.077  ns/op<br>

              AddBenchmark.vectorSegmentSegment        avgt    5  

              217.639 ±   5.076  ns/op</font><br>

            <br>

            <br>

            JDK 26 b17

            <div><br>

              <font face="monospace">Benchmark                          

                     Mode  Cnt     Score     Error  Units<br>

                AddBenchmark.scalarArrayArray            avgt    5  

                209.653 ±   5.990  ns/op<br>

                AddBenchmark.scalarArrayArrayLongStride  avgt    5  

                209.948 ±  12.925  ns/op<br>

                <b>AddBenchmark.scalarSegmentArray          avgt    5  

                  111.790 ±   5.971  ns/op<br>

                  AddBenchmark.scalarSegmentSegment        avgt    5  

                  136.414 ±   3.900  ns/op</b><br>

                AddBenchmark.scalarUnsafeArray           avgt    5  

                657.565 ±   4.705  ns/op<br>

                AddBenchmark.scalarUnsafeUnsafe          avgt    5  

                832.016 ± 210.295  ns/op<br>

                AddBenchmark.unrolledArrayArray          avgt    5

                 1095.963 ± 153.910  ns/op<br>

                AddBenchmark.unrolledSegmentArray        avgt    5  

                138.410 ±  11.933  ns/op<br>

                AddBenchmark.unrolledUnsafeArray         avgt    5  

                685.867 ±  27.075  ns/op<br>

                AddBenchmark.unrolledUnsafeUnsafe        avgt    5  

                817.802 ±  30.841  ns/op<br>

                AddBenchmark.vectorArrayArray            avgt    5  

                149.027 ±   1.269  ns/op<br>

                AddBenchmark.vectorArraySegment          avgt    5  

                164.590 ±   7.283  ns/op<br>

                AddBenchmark.vectorSegmentArray          avgt    5  

                196.908 ±   5.610  ns/op<br>

                AddBenchmark.vectorSegmentSegment        avgt    5  

                242.377 ±   5.488  ns/op</font>

              <div><font face="monospace"><br>

                </font></div>

              <div><font face="monospace"><br>

                </font></div>

              <div><font face="monospace">Best,</font></div>

              <div><font face="monospace">-Antoine</font></div>

            </div>

          </div>

        </div>

      </div>

      <br>

      <div class="gmail_quote gmail_quote_container">

        <div dir="ltr" class="gmail_attr">On Mon, Sep 30, 2024 at

          2:16 PM Antoine Chambille <<a href="mailto:ach@activeviam.com" moz-do-not-send="true" class="moz-txt-link-freetext">ach@activeviam.com</a>>

          wrote:<br>

        </div>

        <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

          <div dir="ltr">Hi Maurizio, thanks for the quick response.

            Looking forward to it.<br>

            <div>-Antoine</div>

          </div>

          <br>

          <div class="gmail_quote">

            <div dir="ltr" class="gmail_attr">On Mon, Sep 30, 2024 at

              2:11 PM Maurizio Cimadamore <<a href="mailto:maurizio.cimadamore@oracle.com" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">maurizio.cimadamore@oracle.com</a>>

              wrote:<br>

            </div>

            <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

              <div>

                <p>Hi Antoine,<br>

                  auto-vectorization on memory segments doesn't work in

                  some cases. This issue is mostly due to:</p>

                <p><a href="https://bugs.openjdk.org/browse/JDK-8324751" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://bugs.openjdk.org/browse/JDK-8324751</a></p>

                <p>That is, when working with a "source" and a "target"

                  segment, if the auto-vectorizer cannot prove that the

                  two segments are disjoint, no vectorization occurs.</p>

                <p>This is an issue for operations like add, or copy,

                  but it's not an issue with something like

                  MemorySegment::fill (as that method only works on a

                  single segment).</p>

                <p>We hope to be able to make some progress on this

                  issue, as that will allow 3rd party routines on memory

                  segment to enjoy vectorization too w/o the need of

                  having an intrinsics in the JDK.</p>

                <p>Maurizio<br>

                </p>

                <p><br>

                </p>

                <p><br>

                </p>

                <p><br>

                </p>

                <div>On 30/09/2024 13:04, Antoine Chambille wrote:<br>

                </div>

                <blockquote type="cite">

                  <div dir="ltr">Hello everyone,<br>

                    <br>

                    I've rebuilt the latest OpenJDK (24) from <a href="https://github.com/openjdk/panama-vector" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/openjdk/panama-vector</a>

                    and run the arrays addition benchmark another time:<br>

                    <br>

                    <font face="monospace">AddBenchmark<br>

                       .scalarArrayArray            thrpt    5   6487636

                      ops/s<br>

                       .scalarArrayArrayLongStride  thrpt    5   1001515

                      ops/s<br>

                       .scalarSegmentArray          thrpt    5   1747531

                      ops/s<br>

                       .scalarSegmentSegment        thrpt    5   1154193

                      ops/s<br>

                       .scalarUnsafeArray           thrpt    5   6970073

                      ops/s<br>

                       .scalarUnsafeUnsafe          thrpt    5   1246625

                      ops/s<br>

                       .unrolledArrayArray          thrpt    5   1251824

                      ops/s<br>

                       .unrolledSegmentArray        thrpt    5   1694164

                      ops/s<br>

                       .unrolledUnsafeArray         thrpt    5   5043685

                      ops/s<br>

                       .unrolledUnsafeUnsafe        thrpt    5   1197024

                      ops/s<br>

                       .vectorArrayArray            thrpt    5   7200224

                      ops/s<br>

                       .vectorArraySegment          thrpt    5   7377553

                      ops/s<br>

                       .vectorSegmentArray          thrpt    5   7263505

                      ops/s<br>

                       .vectorSegmentSegment        thrpt    5   7143647

                      ops/s</font><br>

                    <br>

                    <ul>

                      <li>Performance using the vector API is now very

                        consistent and good across arrays and segments.</li>

                      <li>Reading and writing from/to segments still

                        seems to be disrupting auto-vectorization.

                        Reading with Unsafe works well but it's marked

                        for removal.</li>

                      <li>Less important, manual unrolling also seems to

                        be disrupting auto-vectorization.</li>

                    </ul>

                    <br>

                    <br>

                    Best,<br>

                    -Antoine<br>

                  </div>

                  <br>

                  <div class="gmail_quote">

                    <div dir="ltr" class="gmail_attr">On Tue, Mar 26,

                      2024 at 5:40 PM Vladimir Ivanov <<a href="mailto:vladimir.x.ivanov@oracle.com" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">vladimir.x.ivanov@oracle.com</a>>

                      wrote:<br>

                    </div>

                    <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>

                      >> Personally, I prefer to see vectorizer

                      handling "MoveX2Y (LoadX mem)"<br>

                      >> => "VectorReinterpret (LoadVector

                      mem)" well and then introduce rules to<br>

                      >> strength-reduce it to mismatched access.<br>

                      > <br>

                      > Do I understand you right that you're saying

                      the vector node for MoveL2D<br>

                      > (for instance) is VectorReinterpret so we

                      could vectorize the code.<br>

                      > <br>

                      > Are you then suggesting that we can

                      transform:<br>

                      > <br>

                      > (VectorReinterpret (LoadVector mem)<br>

                      > <br>

                      > into:<br>

                      > <br>

                      > (LoadVector mem)<br>

                      > <br>

                      > with that LoadVector a mismatched access?<br>

                      <br>

                      Yes, but thinking more about it, the latter step

                      may be optional. For <br>

                      example, VectorReinterpret implementation on x86

                      is a no-op, so not much <br>

                      gained from folding VectorReinterpret+LoadVector

                      into a mismatched <br>

                      LoadVector.<br>

                      <br>

                      Best regards,<br>

                      Vladimir Ivanov<br>

                    </blockquote>

                  </div>

                </blockquote>

              </div>

            </blockquote>

          </div>

        </blockquote>

      </div>

    </blockquote>

  </body>

</html>