<!DOCTYPE html><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    <p>Hi Antoine,<br>
      Thanks for the reply. All credit here goes to Emanuel (cc'ed). I
      believe the main issues with memory segments and autovectorization
      were fixed as part of this:</p>
    <p><a class="moz-txt-link-freetext" href="https://bugs.openjdk.org/browse/JDK-8324751">https://bugs.openjdk.org/browse/JDK-8324751</a></p>
    <p>You might also want to watch his great JVMLS talk:</p>
    <p><a class="moz-txt-link-freetext" href="https://inside.java/2025/08/16/jvmls-hotspot-auto-vectorization/">https://inside.java/2025/08/16/jvmls-hotspot-auto-vectorization/</a></p>
    <p>Cheers<br>
      Maurizio<br>
    </p>
    <p><br>
    </p>
    <div class="moz-cite-prefix">On 29/09/2025 10:11, Antoine Chambille
      wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:CAJGQDwmPbKX-9JWu9f=0Zf+G1+B9NC+1LETQ7aSK3njoX96+eA@mail.gmail.com">
      
      <div dir="ltr">Hello,<br>
        <br>
        I've run the array addition benchmark again, JDK-25 and
        JDK-26ea. Looks like the performance issues I’d been tracking
        for a while have been solved in JDK 26.<br>
        <a href="https://github.com/chamb/panama-benchmarks" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/chamb/panama-benchmarks</a><br>
        <br>
        Auto vectorisation of scalar loops seems to work when using
        MemorySegment and is even faster than with java arrays or the
        vector API. Also loops with long stride don't prevent auto
        vectorisation anymore.
        <div><br>
          Not sure exactly who we owe these improvements to, but it's
          awesome! Here's another use case where we can confidently
          switch from Unsafe to MemorySegment. The dream would be to see
          these enhancements land in JDK 25, of course...<br>
          <br>
          <br>
          JDK 25
          <div><br>
            <font face="monospace">Benchmark                            
                 Mode  Cnt     Score     Error  Units<br>
              AddBenchmark.scalarArrayArray            avgt    5  
              167.028 ±   5.604  ns/op<br>
              AddBenchmark.scalarArrayArrayLongStride  avgt    5  
              925.673 ±  37.766  ns/op<br>
              AddBenchmark.scalarSegmentArray          avgt    5  
              550.540 ±   3.576  ns/op<br>
              AddBenchmark.scalarSegmentSegment        avgt    5  
              548.861 ±   1.852  ns/op<br>
              AddBenchmark.scalarUnsafeArray           avgt    5  
              600.489 ± 219.285  ns/op<br>
              AddBenchmark.scalarUnsafeUnsafe          avgt    5  
              776.975 ±  11.601  ns/op<br>
              AddBenchmark.unrolledArrayArray          avgt    5  
              863.526 ±  58.822  ns/op<br>
              AddBenchmark.unrolledSegmentArray        avgt    5  
              584.230 ±  13.863  ns/op<br>
              AddBenchmark.unrolledUnsafeArray         avgt    5  
              584.898 ±  15.792  ns/op<br>
              AddBenchmark.unrolledUnsafeUnsafe        avgt    5  
              761.445 ±  59.935  ns/op<br>
              AddBenchmark.vectorArrayArray            avgt    5  
              177.288 ±   0.653  ns/op<br>
              AddBenchmark.vectorArraySegment          avgt    5  
              141.381 ±   1.211  ns/op<br>
              AddBenchmark.vectorSegmentArray          avgt    5  
              141.576 ±   3.077  ns/op<br>
              AddBenchmark.vectorSegmentSegment        avgt    5  
              217.639 ±   5.076  ns/op</font><br>
            <br>
            <br>
            JDK 26 b17
            <div><br>
              <font face="monospace">Benchmark                          
                     Mode  Cnt     Score     Error  Units<br>
                AddBenchmark.scalarArrayArray            avgt    5  
                209.653 ±   5.990  ns/op<br>
                AddBenchmark.scalarArrayArrayLongStride  avgt    5  
                209.948 ±  12.925  ns/op<br>
                <b>AddBenchmark.scalarSegmentArray          avgt    5  
                  111.790 ±   5.971  ns/op<br>
                  AddBenchmark.scalarSegmentSegment        avgt    5  
                  136.414 ±   3.900  ns/op</b><br>
                AddBenchmark.scalarUnsafeArray           avgt    5  
                657.565 ±   4.705  ns/op<br>
                AddBenchmark.scalarUnsafeUnsafe          avgt    5  
                832.016 ± 210.295  ns/op<br>
                AddBenchmark.unrolledArrayArray          avgt    5
                 1095.963 ± 153.910  ns/op<br>
                AddBenchmark.unrolledSegmentArray        avgt    5  
                138.410 ±  11.933  ns/op<br>
                AddBenchmark.unrolledUnsafeArray         avgt    5  
                685.867 ±  27.075  ns/op<br>
                AddBenchmark.unrolledUnsafeUnsafe        avgt    5  
                817.802 ±  30.841  ns/op<br>
                AddBenchmark.vectorArrayArray            avgt    5  
                149.027 ±   1.269  ns/op<br>
                AddBenchmark.vectorArraySegment          avgt    5  
                164.590 ±   7.283  ns/op<br>
                AddBenchmark.vectorSegmentArray          avgt    5  
                196.908 ±   5.610  ns/op<br>
                AddBenchmark.vectorSegmentSegment        avgt    5  
                242.377 ±   5.488  ns/op</font>
              <div><font face="monospace"><br>
                </font></div>
              <div><font face="monospace"><br>
                </font></div>
              <div><font face="monospace">Best,</font></div>
              <div><font face="monospace">-Antoine</font></div>
            </div>
          </div>
        </div>
      </div>
      <br>
      <div class="gmail_quote gmail_quote_container">
        <div dir="ltr" class="gmail_attr">On Mon, Sep 30, 2024 at
          2:16 PM Antoine Chambille <<a href="mailto:ach@activeviam.com" moz-do-not-send="true" class="moz-txt-link-freetext">ach@activeviam.com</a>>
          wrote:<br>
        </div>
        <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
          <div dir="ltr">Hi Maurizio, thanks for the quick response.
            Looking forward to it.<br>
            <div>-Antoine</div>
          </div>
          <br>
          <div class="gmail_quote">
            <div dir="ltr" class="gmail_attr">On Mon, Sep 30, 2024 at
              2:11 PM Maurizio Cimadamore <<a href="mailto:maurizio.cimadamore@oracle.com" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">maurizio.cimadamore@oracle.com</a>>
              wrote:<br>
            </div>
            <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
              <div>
                <p>Hi Antoine,<br>
                  auto-vectorization on memory segments doesn't work in
                  some cases. This issue is mostly due to:</p>
                <p><a href="https://bugs.openjdk.org/browse/JDK-8324751" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://bugs.openjdk.org/browse/JDK-8324751</a></p>
                <p>That is, when working with a "source" and a "target"
                  segment, if the auto-vectorizer cannot prove that the
                  two segments are disjoint, no vectorization occurs.</p>
                <p>This is an issue for operations like add, or copy,
                  but it's not an issue with something like
                  MemorySegment::fill (as that method only works on a
                  single segment).</p>
                <p>We hope to be able to make some progress on this
                  issue, as that will allow 3rd party routines on memory
                  segment to enjoy vectorization too w/o the need of
                  having an intrinsics in the JDK.</p>
                <p>Maurizio<br>
                </p>
                <p><br>
                </p>
                <p><br>
                </p>
                <p><br>
                </p>
                <div>On 30/09/2024 13:04, Antoine Chambille wrote:<br>
                </div>
                <blockquote type="cite">
                  <div dir="ltr">Hello everyone,<br>
                    <br>
                    I've rebuilt the latest OpenJDK (24) from <a href="https://github.com/openjdk/panama-vector" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/openjdk/panama-vector</a>
                    and run the arrays addition benchmark another time:<br>
                    <br>
                    <font face="monospace">AddBenchmark<br>
                       .scalarArrayArray            thrpt    5   6487636
                      ops/s<br>
                       .scalarArrayArrayLongStride  thrpt    5   1001515
                      ops/s<br>
                       .scalarSegmentArray          thrpt    5   1747531
                      ops/s<br>
                       .scalarSegmentSegment        thrpt    5   1154193
                      ops/s<br>
                       .scalarUnsafeArray           thrpt    5   6970073
                      ops/s<br>
                       .scalarUnsafeUnsafe          thrpt    5   1246625
                      ops/s<br>
                       .unrolledArrayArray          thrpt    5   1251824
                      ops/s<br>
                       .unrolledSegmentArray        thrpt    5   1694164
                      ops/s<br>
                       .unrolledUnsafeArray         thrpt    5   5043685
                      ops/s<br>
                       .unrolledUnsafeUnsafe        thrpt    5   1197024
                      ops/s<br>
                       .vectorArrayArray            thrpt    5   7200224
                      ops/s<br>
                       .vectorArraySegment          thrpt    5   7377553
                      ops/s<br>
                       .vectorSegmentArray          thrpt    5   7263505
                      ops/s<br>
                       .vectorSegmentSegment        thrpt    5   7143647
                      ops/s</font><br>
                    <br>
                    <ul>
                      <li>Performance using the vector API is now very
                        consistent and good across arrays and segments.</li>
                      <li>Reading and writing from/to segments still
                        seems to be disrupting auto-vectorization.
                        Reading with Unsafe works well but it's marked
                        for removal.</li>
                      <li>Less important, manual unrolling also seems to
                        be disrupting auto-vectorization.</li>
                    </ul>
                    <br>
                    <br>
                    Best,<br>
                    -Antoine<br>
                  </div>
                  <br>
                  <div class="gmail_quote">
                    <div dir="ltr" class="gmail_attr">On Tue, Mar 26,
                      2024 at 5:40 PM Vladimir Ivanov <<a href="mailto:vladimir.x.ivanov@oracle.com" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">vladimir.x.ivanov@oracle.com</a>>
                      wrote:<br>
                    </div>
                    <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>
                      >> Personally, I prefer to see vectorizer
                      handling "MoveX2Y (LoadX mem)"<br>
                      >> => "VectorReinterpret (LoadVector
                      mem)" well and then introduce rules to<br>
                      >> strength-reduce it to mismatched access.<br>
                      > <br>
                      > Do I understand you right that you're saying
                      the vector node for MoveL2D<br>
                      > (for instance) is VectorReinterpret so we
                      could vectorize the code.<br>
                      > <br>
                      > Are you then suggesting that we can
                      transform:<br>
                      > <br>
                      > (VectorReinterpret (LoadVector mem)<br>
                      > <br>
                      > into:<br>
                      > <br>
                      > (LoadVector mem)<br>
                      > <br>
                      > with that LoadVector a mismatched access?<br>
                      <br>
                      Yes, but thinking more about it, the latter step
                      may be optional. For <br>
                      example, VectorReinterpret implementation on x86
                      is a no-op, so not much <br>
                      gained from folding VectorReinterpret+LoadVector
                      into a mismatched <br>
                      LoadVector.<br>
                      <br>
                      Best regards,<br>
                      Vladimir Ivanov<br>
                    </blockquote>
                  </div>
                </blockquote>
              </div>
            </blockquote>
          </div>
        </blockquote>
      </div>
    </blockquote>
  </body>
</html>