<!DOCTYPE html><html><head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body>

    <p>Hi Antoine,<br>

      auto-vectorization on memory segments doesn't work in some cases.

      This issue is mostly due to:</p>

    <p><a class="moz-txt-link-freetext" href="https://bugs.openjdk.org/browse/JDK-8324751">https://bugs.openjdk.org/browse/JDK-8324751</a></p>

    <p>That is, when working with a "source" and a "target" segment, if

      the auto-vectorizer cannot prove that the two segments are

      disjoint, no vectorization occurs.</p>

    <p>This is an issue for operations like add, or copy, but it's not

      an issue with something like MemorySegment::fill (as that method

      only works on a single segment).</p>

    <p>We hope to be able to make some progress on this issue, as that

      will allow 3rd party routines on memory segment to enjoy

      vectorization too w/o the need of having an intrinsics in the JDK.</p>

    <p>Maurizio<br>

    </p>

    <p><br>

    </p>

    <p><br>

    </p>

    <p><br>

    </p>

    <div class="moz-cite-prefix">On 30/09/2024 13:04, Antoine Chambille

      wrote:<br>

    </div>

    <blockquote type="cite" cite="mid:CAJGQDwmGNw=tiEMp6L-hUPCP7G7NDuDwwQQu9sc7XyYportP-A@mail.gmail.com">

      <div dir="ltr">Hello everyone,<br>

        <br>

        I've rebuilt the latest OpenJDK (24) from <a href="https://github.com/openjdk/panama-vector" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/openjdk/panama-vector</a>

        and run the arrays addition benchmark another time:<br>

        <br>

        <font face="monospace">AddBenchmark<br>

           .scalarArrayArray            thrpt    5   6487636 ops/s<br>

           .scalarArrayArrayLongStride  thrpt    5   1001515 ops/s<br>

           .scalarSegmentArray          thrpt    5   1747531 ops/s<br>

           .scalarSegmentSegment        thrpt    5   1154193 ops/s<br>

           .scalarUnsafeArray           thrpt    5   6970073 ops/s<br>

           .scalarUnsafeUnsafe          thrpt    5   1246625 ops/s<br>

           .unrolledArrayArray          thrpt    5   1251824 ops/s<br>

           .unrolledSegmentArray        thrpt    5   1694164 ops/s<br>

           .unrolledUnsafeArray         thrpt    5   5043685 ops/s<br>

           .unrolledUnsafeUnsafe        thrpt    5   1197024 ops/s<br>

           .vectorArrayArray            thrpt    5   7200224 ops/s<br>

           .vectorArraySegment          thrpt    5   7377553 ops/s<br>

           .vectorSegmentArray          thrpt    5   7263505 ops/s<br>

           .vectorSegmentSegment        thrpt    5   7143647 ops/s</font><br>

        <br>

        <ul>

          <li>Performance using the vector API is now very consistent

            and good across arrays and segments.</li>

          <li>Reading and writing from/to segments still seems to be

            disrupting auto-vectorization. Reading with Unsafe works

            well but it's marked for removal.</li>

          <li>Less important, manual unrolling also seems to be

            disrupting auto-vectorization.</li>

        </ul>

        <br>

        <br>

        Best,<br>

        -Antoine<br>

      </div>

      <br>

      <div class="gmail_quote">

        <div dir="ltr" class="gmail_attr">On Tue, Mar 26, 2024 at

          5:40 PM Vladimir Ivanov <<a href="mailto:vladimir.x.ivanov@oracle.com" moz-do-not-send="true" class="moz-txt-link-freetext">vladimir.x.ivanov@oracle.com</a>>

          wrote:<br>

        </div>

        <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>

          >> Personally, I prefer to see vectorizer handling

          "MoveX2Y (LoadX mem)"<br>

          >> => "VectorReinterpret (LoadVector mem)" well and

          then introduce rules to<br>

          >> strength-reduce it to mismatched access.<br>

          > <br>

          > Do I understand you right that you're saying the vector

          node for MoveL2D<br>

          > (for instance) is VectorReinterpret so we could vectorize

          the code.<br>

          > <br>

          > Are you then suggesting that we can transform:<br>

          > <br>

          > (VectorReinterpret (LoadVector mem)<br>

          > <br>

          > into:<br>

          > <br>

          > (LoadVector mem)<br>

          > <br>

          > with that LoadVector a mismatched access?<br>

          <br>

          Yes, but thinking more about it, the latter step may be

          optional. For <br>

          example, VectorReinterpret implementation on x86 is a no-op,

          so not much <br>

          gained from folding VectorReinterpret+LoadVector into a

          mismatched <br>

          LoadVector.<br>

          <br>

          Best regards,<br>

          Vladimir Ivanov<br>

        </blockquote>

      </div>

    </blockquote>

  </body>

</html>