<!DOCTYPE html><html><head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body>

    <p>Hi,<br>

      I believe Emmanuel and Vladimir should be able to help on this (I

      see some comments in the PR already)</p>

    <p>Maurizio<br>

    </p>

    <div class="moz-cite-prefix">On 23/01/2025 07:04, Matthias Ernst

      wrote:<br>

    </div>

    <blockquote type="cite" cite="mid:CAKJ3wwEGO5gj7mF+hbY-Bt845XPK+GyMKZo43uRMwof3dBRiXw@mail.gmail.com">

      <div dir="ltr">Could someone help me move this fix (<a href="https://github.com/openjdk/jdk/pull/22856" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/openjdk/jdk/pull/22856</a>)

        over the finish line? I don't think we should leave this

        performance on the table for FFM var handles. Thanks!

        <div><br>

        </div>

      </div>

      <br>

      <div class="gmail_quote gmail_quote_container">

        <div dir="ltr" class="gmail_attr">On Sat, Dec 21, 2024 at

          3:17 PM Matthias Ernst <<a href="mailto:matthias@mernst.org" moz-do-not-send="true" class="moz-txt-link-freetext">matthias@mernst.org</a>>

          wrote:<br>

        </div>

        <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

          <div dir="ltr">I see there's a new issue for this: <a href="https://bugs.openjdk.org/browse/JDK-8346664" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://bugs.openjdk.org/browse/JDK-8346664</a>.

            <div>Started working on a fix: <a href="https://github.com/openjdk/jdk/pull/22856" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/openjdk/jdk/pull/22856</a>

              <div><br>

              </div>

            </div>

          </div>

          <br>

          <div class="gmail_quote">

            <div dir="ltr" class="gmail_attr">On Fri, Dec 20, 2024 at

              1:56 PM Matthias Ernst <<a href="mailto:matthias@mernst.org" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">matthias@mernst.org</a>>

              wrote:<br>

            </div>

            <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

              <div dir="ltr">> alignment check for the +1 case

                somehow evades C2 optimization.<br>

                <div><br>

                </div>

                <div>I believe I may understand how this is happening

                  (disclaimer: chemistry dog reading openjdk source,

                  apologies if I'm missing something):</div>

                <div><br>

                </div>

                <div>A dedicated transformation to simplify (base +

                  offset << shift) & mask type expressions was

                  actually introduced for Panama in <a href="https://github.com/openjdk/jdk/pull/6697" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/openjdk/jdk/pull/6697</a>:</div>

                <div><a href="https://github.com/openjdk/jdk/blob/cf28fd4cbc6507eb69fcfeb33622316eb5b6b0c5/src/hotspot/share/opto/mulnode.cpp#L2128" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/openjdk/jdk/blob/cf28fd4cbc6507eb69fcfeb33622316eb5b6b0c5/src/hotspot/share/opto/mulnode.cpp#L2128</a></div>

                <div>It requires the expression to be a variant of

                  AND(ADD(..., SHIFT(offset, shift), mask).</div>

                <div><br>

                </div>

                <div>This is what turns (offset + i << 8) & 7

                  into a loop-invariant.</div>

                <div><br>

                </div>

                <div>However, before this pattern is checked, a "shift"

                  node like ((i+1) << 8) gets expanded into `i

                  << 8 + 8` here: <a href="https://github.com/openjdk/jdk/blob/cf28fd4cbc6507eb69fcfeb33622316eb5b6b0c5/src/hotspot/share/opto/mulnode.cpp#L961" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/openjdk/jdk/blob/cf28fd4cbc6507eb69fcfeb33622316eb5b6b0c5/src/hotspot/share/opto/mulnode.cpp#L961</a></div>

                <div><br>

                </div>

                <div>Now the node contains nested ADDs and no longer

                  matches the pattern: AND(ADD(..., ADD(SHIFT(offset,

                  shift), shift)), mask) . </div>

                <div><br>

                </div>

                <div>We can defeat the expansion by using a non-constant

                  "1":</div>

                <div>class Aligmnent {</div>

                <div>  long one = 1;</div>

                <div>  handle.get(segment, (i+one) << 8) //

                  magically faster than (i+1)</div>

                <div>}</div>

                <div><br>

                </div>

                <div>For a fix, one could possibly make

                  AndIL_shift_and_mask_is_always_zero recursively

                  descend into the ADD tree.</div>

                <div><br>

                </div>

              </div>

              <br>

              <div class="gmail_quote">

                <div dir="ltr" class="gmail_attr">On Thu, Dec 19, 2024

                  at 2:41 PM Matthias Ernst <<a href="mailto:matthias@mernst.org" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">matthias@mernst.org</a>>

                  wrote:<br>

                </div>

                <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                  <div dir="ltr">Thanks a lot for rewriting/reproducing!

                    <div><br>

                    </div>

                    <div>

                      <div>I've in the meantime tried to take some more

                        complexity out:</div>

                      <div>* replaced arrayElementVarHandle (which uses

                        MemoryLayout#scale with exact math) with a

                        "plain" version (`byteSize() * index`, at the

                        risk of silent overflows).</div>

                      <div>* I also eliminated the

                        "VarHandle#collectCoordinates(h,1, scale)" in

                        favor of a plain varHandle.get(segment, i *

                        byteSize()), after verifying they have identical

                        performance.</div>

                      <div><br>

                      </div>

                      <div>So we're down to a plain

                        "VarHandle.get(segment, i * byteSize)"  in four

                        combinations: ({aligned, unaligned} x {i, i+1}),

                        and the original observation still holds:<br>

                        the combo "aligned read" with "i + 1" somehow

                        trips C2:</div>

                      <div>Alignment.aligned         avgt       0.151  

                               ns/op<br>

                        Alignment.alignedPlus1    avgt       0.298      

                           ns/op<br>

                        Alignment.unaligned       avgt       0.153      

                           ns/op<br>

                      </div>

                      <div>

                        <div>Alignment.unalignedPlus1  avgt       0.153

                                   ns/op<br>

                        </div>

                      </div>

                      <div><br>

                      </div>

                      <div>This leaves us with the assumption that the

                        alignment check for the +1 case somehow evades

                        C2 optimization.</div>

                      <div>And indeed, we can remove all VarHandle

                        business and only benchmark the alignment check,

                        and see that it fails to recognize that it is

                        loop-invariant:</div>

                      <div><br>

                      </div>

                    </div>

                    <div>    // simplified copy of

                      AbstractMemorySegmentImpl#isAlignedForElement</div>

                    <div>    private static boolean

                      isAligned(MemorySegment segment, long offset, long

                      byteAlignment) {<br>

                              return (((segment.address() + offset))

                      & (byteAlignment - 1)) == 0;<br>

                          }<br>

                    </div>

                    <div><br>

                    </div>

                    <div>    @Benchmark<br>

                          public void isAligned() {<br>

                              for (long i = 1; i < COUNT; ++i) {<br>

                                  if (!isAligned(segment, i * 8, 8))

                      throw new IllegalArgumentException();<br>

                              }<br>

                          }<br>

                      <br>

                          @Benchmark<br>

                          public void isAlignedPlus1() {<br>

                              for (long i = 0; i < COUNT - 1; ++i) {<br>

                                  if (!isAligned(segment, (i + 1) * 8,

                      8)) throw new IllegalArgumentException();<br>

                              }<br>

                          }<br>

                      <br>

                      =><br>

                      <br>

                      Alignment.isAligned       thrpt       35160425.491

                               ops/ns<br>

                      Alignment.isAlignedPlus1  thrpt              7.242

                               ops/ns<br>

                      <br>

                    </div>

                    <div>So this seems to be the culprit. Is it an

                      issue? Idk. Using a plain offset multiplication

                      instead of the overflow-protecting

                      MemoryLayout#scale actually seems to have a bigger

                      impact on performance than this.</div>

                    <div><br>

                    </div>

                    <div>> hsdis</div>

                    <div><br>

                    </div>

                    <div>I actually looked at hsdis and tried to

                      use Jorn's precompiled dylib, but it doesn't seem

                      to load for me. Installing the whole toolchain and

                      building my own is probably beyond what I'm trying

                      to do here (esp since I'm not even sure I could

                      read the assembly...)</div>

                    <div><br>

                    </div>

                    <div>Matthias</div>

                    <div><br>

                    </div>

                  </div>

                  <br>

                  <div class="gmail_quote">

                    <div dir="ltr" class="gmail_attr">On Wed, Dec 18,

                      2024 at 3:13 PM Per-Ake Minborg <<a href="mailto:per-ake.minborg@oracle.com" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">per-ake.minborg@oracle.com</a>>

                      wrote:<br>

                    </div>

                    <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                      <div>

                        <div dir="ltr">

                          <div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">

                            Hi Matthias!<br>

                            <br>

                            I've rewritten the benchmark slightly (just

                            to make them "normalized" the way we use to

                            write them) even though your benchmarks work

                            equally well. See attachment. By using the

                            commands <br>

                            <br>

                          </div>

                          <div style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace;font-size:12pt;color:rgb(0,0,0)">

                            jvmArgsAppend = {</div>

                          <pre><div style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace;font-size:12pt;color:rgb(0,0,0)">        "-XX:+PrintCompilation",

        "-XX:+UnlockDiagnosticVMOptions",

        "-XX:+PrintInlining" }</div></pre>

                          <div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">

                            in a <span style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace">

                              @Fork</span> annotation, and observing the

                            output, it appears all the methods are

                            inlined properly. So, even though some

                            methods are more complex, it appears they

                            are treated in the same way when it comes to

                            inlining.<br>

                            <br>

                             By looking at the actual assembly generated

                            for the benchmarks using these commands (for

                            an M1 in my case):<br>

                            <br>

                          </div>

                          <pre><div style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace;font-size:12pt;color:rgb(0,0,0)">@Fork(value = 1, jvmArgsAppend = {

        "-Xbatch",

        "-XX:-TieredCompilation",

        "-XX:CompileCommand=dontinline,org.openjdk.bench.java.lang.foreign.Alignment::findAligned*",

        "-XX:CompileCommand=PrintAssembly,org.openjdk.bench.java.lang.foreign.Alignment::findAligned*"

})</div></pre>

                          <div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">

                            I can see that the C2 compiler is able to

                            unroll the segment access in the <span style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace">

                              findAligned</span> method but not in the <span style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace">

                              findAlignedNext method. </span>This is

                            one reason<span style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace"> findAligned</span> is

                            faster. In order to see real assembly, the

                            "hsdis-aarch64.dylib" must be present and it

                            is recommended to use a "fast-debug" version

                            of the JDK. Read more on Jorn Vernee's blog

                            here: <a href="https://jornvernee.github.io/hsdis/2022/04/30/hsdis.html" id="m_8056903533551950494m_8138333552085303315m_1732376954563610074m_-7185004750415086813m_-2362717009702851623m_4218413662007508028LPlnk828414" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">

https://jornvernee.github.io/hsdis/2022/04/30/hsdis.html</a></div>

                          <div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">

                            <br>

                          </div>

                          <div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">

                            <br>

                          </div>

                          <div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">

                            The question then becomes why that is the

                            case. This drives us into another field of

                            expertise where I am not the right person to

                            provide an answer. Generally, there is no

                            guarantee as to how the C2 compiler works

                            and we are improving it continuously. Maybe

                            someone else can provide additional

                            information.</div>

                          <div style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace;font-size:12pt;color:rgb(0,0,0)">

                            <br>

                          </div>

                          <div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">

                            Best, Per Minborg</div>

                          <div style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace;font-size:12pt;color:rgb(0,0,0)">

                            <br>

                          </div>

                          <div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">

                            <br>

                            <br>

                            <br>

                          </div>

                          <hr style="display:inline-block;width:98%">

                          <div id="m_8056903533551950494m_8138333552085303315m_1732376954563610074m_-7185004750415086813m_-2362717009702851623m_4218413662007508028divRplyFwdMsg" dir="ltr"><font style="font-size:11pt" face="Calibri, sans-serif" color="#000000"><b>From:</b>

                              panama-dev <<a href="mailto:panama-dev-retn@openjdk.org" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">panama-dev-retn@openjdk.org</a>>

                              on behalf of Matthias Ernst <<a href="mailto:matthias@mernst.org" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">matthias@mernst.org</a>><br>

                              <b>Sent:</b> Wednesday, December 18, 2024

                              9:26 AM<br>

                              <b>To:</b> <a href="mailto:panama-dev@openjdk.org" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">panama-dev@openjdk.org</a>

                              <<a href="mailto:panama-dev@openjdk.org" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">panama-dev@openjdk.org</a>><br>

                              <b>Subject:</b> performance:

                              arrayElementVarHandle / calculated index /

                              aligned vs unaligned</font>

                            <div> </div>

                          </div>

                          <div>

                            <div dir="ltr">Hi,

                              <div><br>

                              </div>

                              <div>I'm trying to use the foreign memory

                                api to interpret some variable-length

                                encoded data, where an offset vector

                                encodes the start offset of each stride.

                                Accessing element `i` in this case

                                involves reading `offset[i+1]` in

                                addition to `offset[i]`. The offset

                                vector is modeled as a

                                `JAVA_LONG.arrayElementVarHandle()`.</div>

                              <div><br>

                              </div>

                              <div>Just out of curiosity about bounds

                                and alignment checks I switched the

                                layout to JAVA_LONG_UNALIGNED for

                                reading (data is still aligned) and I

                                saw a large difference in performance

                                where I didn't expect one, and it seems

                                to boil down to the computed index

                                `endOffset[i+1]` access, not for the

                                `[i]` case. My expectation would have

                                been that all variants exhibit the same

                                performance, since alignment checks

                                would be moved out of the loop.<br>

                                <br>

                              </div>

                              <div>A micro-benchmark (attached) to

                                demonstrate:</div>

                              <div>long-aligned memory segment, looping

                                over the same elements in 6 different

                                ways:</div>

                              <div>{aligned, unaligned} x {segment[i] ,

                                segment[i+1],  segment[i+1] (w/ base

                                offset) } gives very different results

                                for aligned[i+1] (but not for

                                aligned[i]):<br>

                                <br>

                                Benchmark                         Mode

                                 Cnt    Score   Error  Units<br>

                                Alignment.findAligned            thrpt  

                                    217.050          ops/s<br>

                                Alignment.findAlignedPlusOne     thrpt  

                                    110.366          ops/s. <= #####<br>

                                Alignment.findAlignedNext    thrpt      

                                110.377          ops/s. <= #####<br>

                                Alignment.findUnaligned          thrpt  

                                    216.591          ops/s<br>

                                Alignment.findUnalignedPlusOne   thrpt  

                                    215.843          ops/s<br>

                              </div>

                              <div>Alignment.findUnalignedNext  thrpt  

                                    216.483          ops/s</div>

                              <div>

                                <div><br>

                                </div>

                              </div>

                              <div>openjdk version "23.0.1" 2024-10-15<br>

                                OpenJDK Runtime Environment (build

                                23.0.1+11-39)<br>

                                OpenJDK 64-Bit Server VM (build

                                23.0.1+11-39, mixed mode, sharing)</div>

                              <div>Macbook Air M3</div>

                              <div><br>

                              </div>

                              <div>Needless to say that the

                                difference was smaller with more app

                                code in play, but large enough to give

                                me pause. Likely it wouldn't matter at

                                all but I want to have a better idea

                                which design choices to pay attention

                                to. With the foreign memory api, I

                                find it a bit difficult to distinguish

                                convenience from performance-relevant

                                options (e.g. using path expressions vs

                                computed offsets vs using a base offset.

                                Besides "make layouts and varhandles

                                static final" what would be other rules

                                of thumb?)</div>

                              <div><br>

                              </div>

                              <div>Thx</div>

                              <font color="#888888">

                                <div>Matthias</div>

                                <div><br>

                                </div>

                              </font></div>

                          </div>

                        </div>

                      </div>

                    </blockquote>

                  </div>

                </blockquote>

              </div>

            </blockquote>

          </div>

        </blockquote>

      </div>

    </blockquote>

  </body>

</html>