<!DOCTYPE html><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body>
    <p>Hi,<br>
      I believe Emmanuel and Vladimir should be able to help on this (I
      see some comments in the PR already)</p>
    <p>Maurizio<br>
    </p>
    <div class="moz-cite-prefix">On 23/01/2025 07:04, Matthias Ernst
      wrote:<br>
    </div>
    <blockquote type="cite" cite="mid:CAKJ3wwEGO5gj7mF+hbY-Bt845XPK+GyMKZo43uRMwof3dBRiXw@mail.gmail.com">
      
      <div dir="ltr">Could someone help me move this fix (<a href="https://github.com/openjdk/jdk/pull/22856" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/openjdk/jdk/pull/22856</a>)
        over the finish line? I don't think we should leave this
        performance on the table for FFM var handles. Thanks!
        <div><br>
        </div>
      </div>
      <br>
      <div class="gmail_quote gmail_quote_container">
        <div dir="ltr" class="gmail_attr">On Sat, Dec 21, 2024 at
          3:17 PM Matthias Ernst <<a href="mailto:matthias@mernst.org" moz-do-not-send="true" class="moz-txt-link-freetext">matthias@mernst.org</a>>
          wrote:<br>
        </div>
        <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
          <div dir="ltr">I see there's a new issue for this: <a href="https://bugs.openjdk.org/browse/JDK-8346664" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://bugs.openjdk.org/browse/JDK-8346664</a>.
            <div>Started working on a fix: <a href="https://github.com/openjdk/jdk/pull/22856" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/openjdk/jdk/pull/22856</a>
              <div><br>
              </div>
            </div>
          </div>
          <br>
          <div class="gmail_quote">
            <div dir="ltr" class="gmail_attr">On Fri, Dec 20, 2024 at
              1:56 PM Matthias Ernst <<a href="mailto:matthias@mernst.org" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">matthias@mernst.org</a>>
              wrote:<br>
            </div>
            <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
              <div dir="ltr">> alignment check for the +1 case
                somehow evades C2 optimization.<br>
                <div><br>
                </div>
                <div>I believe I may understand how this is happening
                  (disclaimer: chemistry dog reading openjdk source,
                  apologies if I'm missing something):</div>
                <div><br>
                </div>
                <div>A dedicated transformation to simplify (base +
                  offset << shift) & mask type expressions was
                  actually introduced for Panama in <a href="https://github.com/openjdk/jdk/pull/6697" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/openjdk/jdk/pull/6697</a>:</div>
                <div><a href="https://github.com/openjdk/jdk/blob/cf28fd4cbc6507eb69fcfeb33622316eb5b6b0c5/src/hotspot/share/opto/mulnode.cpp#L2128" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/openjdk/jdk/blob/cf28fd4cbc6507eb69fcfeb33622316eb5b6b0c5/src/hotspot/share/opto/mulnode.cpp#L2128</a></div>
                <div>It requires the expression to be a variant of
                  AND(ADD(..., SHIFT(offset, shift), mask).</div>
                <div><br>
                </div>
                <div>This is what turns (offset + i << 8) & 7
                  into a loop-invariant.</div>
                <div><br>
                </div>
                <div>However, before this pattern is checked, a "shift"
                  node like ((i+1) << 8) gets expanded into `i
                  << 8 + 8` here: <a href="https://github.com/openjdk/jdk/blob/cf28fd4cbc6507eb69fcfeb33622316eb5b6b0c5/src/hotspot/share/opto/mulnode.cpp#L961" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/openjdk/jdk/blob/cf28fd4cbc6507eb69fcfeb33622316eb5b6b0c5/src/hotspot/share/opto/mulnode.cpp#L961</a></div>
                <div><br>
                </div>
                <div>Now the node contains nested ADDs and no longer
                  matches the pattern: AND(ADD(..., ADD(SHIFT(offset,
                  shift), shift)), mask) . </div>
                <div><br>
                </div>
                <div>We can defeat the expansion by using a non-constant
                  "1":</div>
                <div>class Aligmnent {</div>
                <div>  long one = 1;</div>
                <div>  handle.get(segment, (i+one) << 8) //
                  magically faster than (i+1)</div>
                <div>}</div>
                <div><br>
                </div>
                <div>For a fix, one could possibly make
                  AndIL_shift_and_mask_is_always_zero recursively
                  descend into the ADD tree.</div>
                <div><br>
                </div>
              </div>
              <br>
              <div class="gmail_quote">
                <div dir="ltr" class="gmail_attr">On Thu, Dec 19, 2024
                  at 2:41 PM Matthias Ernst <<a href="mailto:matthias@mernst.org" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">matthias@mernst.org</a>>
                  wrote:<br>
                </div>
                <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                  <div dir="ltr">Thanks a lot for rewriting/reproducing!
                    <div><br>
                    </div>
                    <div>
                      <div>I've in the meantime tried to take some more
                        complexity out:</div>
                      <div>* replaced arrayElementVarHandle (which uses
                        MemoryLayout#scale with exact math) with a
                        "plain" version (`byteSize() * index`, at the
                        risk of silent overflows).</div>
                      <div>* I also eliminated the
                        "VarHandle#collectCoordinates(h,1, scale)" in
                        favor of a plain varHandle.get(segment, i *
                        byteSize()), after verifying they have identical
                        performance.</div>
                      <div><br>
                      </div>
                      <div>So we're down to a plain
                        "VarHandle.get(segment, i * byteSize)"  in four
                        combinations: ({aligned, unaligned} x {i, i+1}),
                        and the original observation still holds:<br>
                        the combo "aligned read" with "i + 1" somehow
                        trips C2:</div>
                      <div>Alignment.aligned         avgt       0.151  
                               ns/op<br>
                        Alignment.alignedPlus1    avgt       0.298      
                           ns/op<br>
                        Alignment.unaligned       avgt       0.153      
                           ns/op<br>
                      </div>
                      <div>
                        <div>Alignment.unalignedPlus1  avgt       0.153
                                   ns/op<br>
                        </div>
                      </div>
                      <div><br>
                      </div>
                      <div>This leaves us with the assumption that the
                        alignment check for the +1 case somehow evades
                        C2 optimization.</div>
                      <div>And indeed, we can remove all VarHandle
                        business and only benchmark the alignment check,
                        and see that it fails to recognize that it is
                        loop-invariant:</div>
                      <div><br>
                      </div>
                    </div>
                    <div>    // simplified copy of
                      AbstractMemorySegmentImpl#isAlignedForElement</div>
                    <div>    private static boolean
                      isAligned(MemorySegment segment, long offset, long
                      byteAlignment) {<br>
                              return (((segment.address() + offset))
                      & (byteAlignment - 1)) == 0;<br>
                          }<br>
                    </div>
                    <div><br>
                    </div>
                    <div>    @Benchmark<br>
                          public void isAligned() {<br>
                              for (long i = 1; i < COUNT; ++i) {<br>
                                  if (!isAligned(segment, i * 8, 8))
                      throw new IllegalArgumentException();<br>
                              }<br>
                          }<br>
                      <br>
                          @Benchmark<br>
                          public void isAlignedPlus1() {<br>
                              for (long i = 0; i < COUNT - 1; ++i) {<br>
                                  if (!isAligned(segment, (i + 1) * 8,
                      8)) throw new IllegalArgumentException();<br>
                              }<br>
                          }<br>
                      <br>
                      =><br>
                      <br>
                      Alignment.isAligned       thrpt       35160425.491
                               ops/ns<br>
                      Alignment.isAlignedPlus1  thrpt              7.242
                               ops/ns<br>
                      <br>
                    </div>
                    <div>So this seems to be the culprit. Is it an
                      issue? Idk. Using a plain offset multiplication
                      instead of the overflow-protecting
                      MemoryLayout#scale actually seems to have a bigger
                      impact on performance than this.</div>
                    <div><br>
                    </div>
                    <div>> hsdis</div>
                    <div><br>
                    </div>
                    <div>I actually looked at hsdis and tried to
                      use Jorn's precompiled dylib, but it doesn't seem
                      to load for me. Installing the whole toolchain and
                      building my own is probably beyond what I'm trying
                      to do here (esp since I'm not even sure I could
                      read the assembly...)</div>
                    <div><br>
                    </div>
                    <div>Matthias</div>
                    <div><br>
                    </div>
                  </div>
                  <br>
                  <div class="gmail_quote">
                    <div dir="ltr" class="gmail_attr">On Wed, Dec 18,
                      2024 at 3:13 PM Per-Ake Minborg <<a href="mailto:per-ake.minborg@oracle.com" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">per-ake.minborg@oracle.com</a>>
                      wrote:<br>
                    </div>
                    <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                      <div>
                        <div dir="ltr">
                          <div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
                            Hi Matthias!<br>
                            <br>
                            I've rewritten the benchmark slightly (just
                            to make them "normalized" the way we use to
                            write them) even though your benchmarks work
                            equally well. See attachment. By using the
                            commands <br>
                            <br>
                          </div>
                          <div style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace;font-size:12pt;color:rgb(0,0,0)">
                            jvmArgsAppend = {</div>
                          <pre><div style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace;font-size:12pt;color:rgb(0,0,0)">        "-XX:+PrintCompilation",
        "-XX:+UnlockDiagnosticVMOptions",
        "-XX:+PrintInlining" }</div></pre>
                          <div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
                            in a <span style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace">
                              @Fork</span> annotation, and observing the
                            output, it appears all the methods are
                            inlined properly. So, even though some
                            methods are more complex, it appears they
                            are treated in the same way when it comes to
                            inlining.<br>
                            <br>
                             By looking at the actual assembly generated
                            for the benchmarks using these commands (for
                            an M1 in my case):<br>
                            <br>
                          </div>
                          <pre><div style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace;font-size:12pt;color:rgb(0,0,0)">@Fork(value = 1, jvmArgsAppend = {
        "-Xbatch",
        "-XX:-TieredCompilation",
        "-XX:CompileCommand=dontinline,org.openjdk.bench.java.lang.foreign.Alignment::findAligned*",
        "-XX:CompileCommand=PrintAssembly,org.openjdk.bench.java.lang.foreign.Alignment::findAligned*"
})</div></pre>
                          <div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
                            I can see that the C2 compiler is able to
                            unroll the segment access in the <span style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace">
                              findAligned</span> method but not in the <span style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace">
                              findAlignedNext method. </span>This is
                            one reason<span style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace"> findAligned</span> is
                            faster. In order to see real assembly, the
                            "hsdis-aarch64.dylib" must be present and it
                            is recommended to use a "fast-debug" version
                            of the JDK. Read more on Jorn Vernee's blog
                            here: <a href="https://jornvernee.github.io/hsdis/2022/04/30/hsdis.html" id="m_8056903533551950494m_8138333552085303315m_1732376954563610074m_-7185004750415086813m_-2362717009702851623m_4218413662007508028LPlnk828414" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">
https://jornvernee.github.io/hsdis/2022/04/30/hsdis.html</a></div>
                          <div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
                            <br>
                          </div>
                          <div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
                            <br>
                          </div>
                          <div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
                            The question then becomes why that is the
                            case. This drives us into another field of
                            expertise where I am not the right person to
                            provide an answer. Generally, there is no
                            guarantee as to how the C2 compiler works
                            and we are improving it continuously. Maybe
                            someone else can provide additional
                            information.</div>
                          <div style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace;font-size:12pt;color:rgb(0,0,0)">
                            <br>
                          </div>
                          <div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
                            Best, Per Minborg</div>
                          <div style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace;font-size:12pt;color:rgb(0,0,0)">
                            <br>
                          </div>
                          <div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
                            <br>
                            <br>
                            <br>
                          </div>
                          <hr style="display:inline-block;width:98%">
                          <div id="m_8056903533551950494m_8138333552085303315m_1732376954563610074m_-7185004750415086813m_-2362717009702851623m_4218413662007508028divRplyFwdMsg" dir="ltr"><font style="font-size:11pt" face="Calibri, sans-serif" color="#000000"><b>From:</b>
                              panama-dev <<a href="mailto:panama-dev-retn@openjdk.org" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">panama-dev-retn@openjdk.org</a>>
                              on behalf of Matthias Ernst <<a href="mailto:matthias@mernst.org" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">matthias@mernst.org</a>><br>
                              <b>Sent:</b> Wednesday, December 18, 2024
                              9:26 AM<br>
                              <b>To:</b> <a href="mailto:panama-dev@openjdk.org" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">panama-dev@openjdk.org</a>
                              <<a href="mailto:panama-dev@openjdk.org" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">panama-dev@openjdk.org</a>><br>
                              <b>Subject:</b> performance:
                              arrayElementVarHandle / calculated index /
                              aligned vs unaligned</font>
                            <div> </div>
                          </div>
                          <div>
                            <div dir="ltr">Hi,
                              <div><br>
                              </div>
                              <div>I'm trying to use the foreign memory
                                api to interpret some variable-length
                                encoded data, where an offset vector
                                encodes the start offset of each stride.
                                Accessing element `i` in this case
                                involves reading `offset[i+1]` in
                                addition to `offset[i]`. The offset
                                vector is modeled as a
                                `JAVA_LONG.arrayElementVarHandle()`.</div>
                              <div><br>
                              </div>
                              <div>Just out of curiosity about bounds
                                and alignment checks I switched the
                                layout to JAVA_LONG_UNALIGNED for
                                reading (data is still aligned) and I
                                saw a large difference in performance
                                where I didn't expect one, and it seems
                                to boil down to the computed index
                                `endOffset[i+1]` access, not for the
                                `[i]` case. My expectation would have
                                been that all variants exhibit the same
                                performance, since alignment checks
                                would be moved out of the loop.<br>
                                <br>
                              </div>
                              <div>A micro-benchmark (attached) to
                                demonstrate:</div>
                              <div>long-aligned memory segment, looping
                                over the same elements in 6 different
                                ways:</div>
                              <div>{aligned, unaligned} x {segment[i] ,
                                segment[i+1],  segment[i+1] (w/ base
                                offset) } gives very different results
                                for aligned[i+1] (but not for
                                aligned[i]):<br>
                                <br>
                                Benchmark                         Mode
                                 Cnt    Score   Error  Units<br>
                                Alignment.findAligned            thrpt  
                                    217.050          ops/s<br>
                                Alignment.findAlignedPlusOne     thrpt  
                                    110.366          ops/s. <= #####<br>
                                Alignment.findAlignedNext    thrpt      
                                110.377          ops/s. <= #####<br>
                                Alignment.findUnaligned          thrpt  
                                    216.591          ops/s<br>
                                Alignment.findUnalignedPlusOne   thrpt  
                                    215.843          ops/s<br>
                              </div>
                              <div>Alignment.findUnalignedNext  thrpt  
                                    216.483          ops/s</div>
                              <div>
                                <div><br>
                                </div>
                              </div>
                              <div>openjdk version "23.0.1" 2024-10-15<br>
                                OpenJDK Runtime Environment (build
                                23.0.1+11-39)<br>
                                OpenJDK 64-Bit Server VM (build
                                23.0.1+11-39, mixed mode, sharing)</div>
                              <div>Macbook Air M3</div>
                              <div><br>
                              </div>
                              <div>Needless to say that the
                                difference was smaller with more app
                                code in play, but large enough to give
                                me pause. Likely it wouldn't matter at
                                all but I want to have a better idea
                                which design choices to pay attention
                                to. With the foreign memory api, I
                                find it a bit difficult to distinguish
                                convenience from performance-relevant
                                options (e.g. using path expressions vs
                                computed offsets vs using a base offset.
                                Besides "make layouts and varhandles
                                static final" what would be other rules
                                of thumb?)</div>
                              <div><br>
                              </div>
                              <div>Thx</div>
                              <font color="#888888">
                                <div>Matthias</div>
                                <div><br>
                                </div>
                              </font></div>
                          </div>
                        </div>
                      </div>
                    </blockquote>
                  </div>
                </blockquote>
              </div>
            </blockquote>
          </div>
        </blockquote>
      </div>
    </blockquote>
  </body>
</html>