<div dir="ltr">Could someone help me move this fix (<a href="https://github.com/openjdk/jdk/pull/22856" target="_blank">https://github.com/openjdk/jdk/pull/22856</a>) over the finish line? I don't think we should leave this performance on the table for FFM var handles. Thanks!<div><br></div></div><br><div class="gmail_quote gmail_quote_container"><div dir="ltr" class="gmail_attr">On Sat, Dec 21, 2024 at 3:17 PM Matthias Ernst <<a href="mailto:matthias@mernst.org">matthias@mernst.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">I see there's a new issue for this: <a href="https://bugs.openjdk.org/browse/JDK-8346664" target="_blank">https://bugs.openjdk.org/browse/JDK-8346664</a>.<div>Started working on a fix: <a href="https://github.com/openjdk/jdk/pull/22856" target="_blank">https://github.com/openjdk/jdk/pull/22856</a><div><br></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Dec 20, 2024 at 1:56 PM Matthias Ernst <<a href="mailto:matthias@mernst.org" target="_blank">matthias@mernst.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">> alignment check for the +1 case somehow evades C2 optimization.<br><div><br></div><div>I believe I may understand how this is happening (disclaimer: chemistry dog reading openjdk source, apologies if I'm missing something):</div><div><br></div><div>A dedicated transformation to simplify (base + offset << shift) & mask type expressions was actually introduced for Panama in <a href="https://github.com/openjdk/jdk/pull/6697" target="_blank">https://github.com/openjdk/jdk/pull/6697</a>:</div><div><a href="https://github.com/openjdk/jdk/blob/cf28fd4cbc6507eb69fcfeb33622316eb5b6b0c5/src/hotspot/share/opto/mulnode.cpp#L2128" target="_blank">https://github.com/openjdk/jdk/blob/cf28fd4cbc6507eb69fcfeb33622316eb5b6b0c5/src/hotspot/share/opto/mulnode.cpp#L2128</a></div><div>It requires the expression to be a variant of AND(ADD(..., SHIFT(offset, shift), mask).</div><div><br></div><div>This is what turns (offset + i << 8) & 7 into a loop-invariant.</div><div><br></div><div>However, before this pattern is checked, a "shift" node like ((i+1) << 8) gets expanded into `i << 8 + 8` here: <a href="https://github.com/openjdk/jdk/blob/cf28fd4cbc6507eb69fcfeb33622316eb5b6b0c5/src/hotspot/share/opto/mulnode.cpp#L961" target="_blank">https://github.com/openjdk/jdk/blob/cf28fd4cbc6507eb69fcfeb33622316eb5b6b0c5/src/hotspot/share/opto/mulnode.cpp#L961</a></div><div><br></div><div>Now the node contains nested ADDs and no longer matches the pattern: AND(ADD(..., ADD(SHIFT(offset, shift), shift)), mask) . </div><div><br></div><div>We can defeat the expansion by using a non-constant "1":</div><div>class Aligmnent {</div><div>  long one = 1;</div><div>  handle.get(segment, (i+one) << 8) // magically faster than (i+1)</div><div>}</div><div><br></div><div>For a fix, one could possibly make AndIL_shift_and_mask_is_always_zero recursively descend into the ADD tree.</div><div><br></div><div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Dec 19, 2024 at 2:41 PM Matthias Ernst <<a href="mailto:matthias@mernst.org" target="_blank">matthias@mernst.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Thanks a lot for rewriting/reproducing!<div><br></div><div><div>I've in the meantime tried to take some more complexity out:</div><div>* replaced arrayElementVarHandle (which uses MemoryLayout#scale with exact math) with a "plain" version (`byteSize() * index`, at the risk of silent overflows).</div><div>* I also eliminated the "VarHandle#collectCoordinates(h,1, scale)" in favor of a plain varHandle.get(segment, i * byteSize()), after verifying they have identical performance.</div><div><br></div><div>So we're down to a plain "VarHandle.get(segment, i * byteSize)"  in four combinations: ({aligned, unaligned} x {i, i+1}), and the original observation still holds:<br>the combo "aligned read" with "i + 1" somehow trips C2:</div><div>Alignment.aligned         avgt       0.151          ns/op<br>Alignment.alignedPlus1    avgt       0.298          ns/op<br>Alignment.unaligned       avgt       0.153          ns/op<br></div><div><div>Alignment.unalignedPlus1  avgt       0.153          ns/op<br></div></div><div><br></div><div>This leaves us with the assumption that the alignment check for the +1 case somehow evades C2 optimization.</div><div>And indeed, we can remove all VarHandle business and only benchmark the alignment check, and see that it fails to recognize that it is loop-invariant:</div><div><br></div></div><div>    // simplified copy of AbstractMemorySegmentImpl#isAlignedForElement</div><div>    private static boolean isAligned(MemorySegment segment, long offset, long byteAlignment) {<br>        return (((segment.address() + offset)) & (byteAlignment - 1)) == 0;<br>    }<br></div><div><br></div><div>    @Benchmark<br>    public void isAligned() {<br>        for (long i = 1; i < COUNT; ++i) {<br>            if (!isAligned(segment, i * 8, 8)) throw new IllegalArgumentException();<br>        }<br>    }<br><br>    @Benchmark<br>    public void isAlignedPlus1() {<br>        for (long i = 0; i < COUNT - 1; ++i) {<br>            if (!isAligned(segment, (i + 1) * 8, 8)) throw new IllegalArgumentException();<br>        }<br>    }<br><br>=><br><br>Alignment.isAligned       thrpt       35160425.491          ops/ns<br>Alignment.isAlignedPlus1  thrpt              7.242          ops/ns<br><br></div><div>So this seems to be the culprit. Is it an issue? Idk. Using a plain offset multiplication instead of the overflow-protecting MemoryLayout#scale actually seems to have a bigger impact on performance than this.</div><div><br></div><div>> hsdis</div><div><br></div><div>I actually looked at hsdis and tried to use Jorn's precompiled dylib, but it doesn't seem to load for me. Installing the whole toolchain and building my own is probably beyond what I'm trying to do here (esp since I'm not even sure I could read the assembly...)</div><div><br></div><div>Matthias</div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Dec 18, 2024 at 3:13 PM Per-Ake Minborg <<a href="mailto:per-ake.minborg@oracle.com" target="_blank">per-ake.minborg@oracle.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>


<div dir="ltr">

<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">

Hi Matthias!<br>

<br>

I've rewritten the benchmark slightly (just to make them "normalized" the way we use to write them) even though your benchmarks work equally well. See attachment. By using the commands <br>

<br>

</div>

<div style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace;font-size:12pt;color:rgb(0,0,0)">

jvmArgsAppend = {</div>

<pre><div style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace;font-size:12pt;color:rgb(0,0,0)">        "-XX:+PrintCompilation",<br>        "-XX:+UnlockDiagnosticVMOptions",<br>        "-XX:+PrintInlining" }</div></pre>

<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">

in a <span style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace">

@Fork</span> annotation, and observing the output, it appears all the methods are inlined properly. So, even though some methods are more complex, it appears they are treated in the same way when it comes to inlining.<br>

<br>

 By looking at the actual assembly generated for the benchmarks using these commands (for an M1 in my case):<br>

<br>

</div>

<pre><div style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace;font-size:12pt;color:rgb(0,0,0)">@Fork(value = 1, jvmArgsAppend = {<br>        "-Xbatch",<br>        "-XX:-TieredCompilation",<br>        "-XX:CompileCommand=dontinline,org.openjdk.bench.java.lang.foreign.Alignment::findAligned*",<br>        "-XX:CompileCommand=PrintAssembly,org.openjdk.bench.java.lang.foreign.Alignment::findAligned*"<br>})</div></pre>

<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">

I can see that the C2 compiler is able to unroll the segment access in the <span style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace">

findAligned</span> method but not in the <span style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace">

findAlignedNext method. </span>This is one reason<span style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace"> findAligned</span> is faster. In order to see real assembly, the "hsdis-aarch64.dylib" must be present and it is recommended

 to use a "fast-debug" version of the JDK. Read more on Jorn Vernee's blog here: <a href="https://jornvernee.github.io/hsdis/2022/04/30/hsdis.html" id="m_8056903533551950494m_8138333552085303315m_1732376954563610074m_-7185004750415086813m_-2362717009702851623m_4218413662007508028LPlnk828414" target="_blank">

https://jornvernee.github.io/hsdis/2022/04/30/hsdis.html</a></div>

<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">

<br>

</div>

<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">

<br>

</div>

<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">

The question then becomes why that is the case. This drives us into another field of expertise where I am not the right person to provide an answer. Generally, there is no guarantee as to how the C2 compiler works and we are improving it continuously. Maybe

 someone else can provide additional information.</div>

<div style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace;font-size:12pt;color:rgb(0,0,0)">

<br>

</div>

<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">

Best, Per Minborg</div>

<div style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace;font-size:12pt;color:rgb(0,0,0)">

<br>

</div>

<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">

<br>

<br>

<br>

</div>

<div id="m_8056903533551950494m_8138333552085303315m_1732376954563610074m_-7185004750415086813m_-2362717009702851623m_4218413662007508028appendonsend"></div>

<hr style="display:inline-block;width:98%">

<div id="m_8056903533551950494m_8138333552085303315m_1732376954563610074m_-7185004750415086813m_-2362717009702851623m_4218413662007508028divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> panama-dev <<a href="mailto:panama-dev-retn@openjdk.org" target="_blank">panama-dev-retn@openjdk.org</a>> on behalf of Matthias Ernst <<a href="mailto:matthias@mernst.org" target="_blank">matthias@mernst.org</a>><br>

<b>Sent:</b> Wednesday, December 18, 2024 9:26 AM<br>

<b>To:</b> <a href="mailto:panama-dev@openjdk.org" target="_blank">panama-dev@openjdk.org</a> <<a href="mailto:panama-dev@openjdk.org" target="_blank">panama-dev@openjdk.org</a>><br>

<b>Subject:</b> performance: arrayElementVarHandle / calculated index / aligned vs unaligned</font>

<div> </div>

</div>

<div>

<div dir="ltr">Hi,

<div><br>

</div>

<div>I'm trying to use the foreign memory api to interpret some variable-length encoded data, where an offset vector encodes the start offset of each stride. Accessing element `i` in this case involves reading `offset[i+1]` in addition to `offset[i]`. The offset

 vector is modeled as a `JAVA_LONG.arrayElementVarHandle()`.</div>

<div><br>

</div>

<div>Just out of curiosity about bounds and alignment checks I switched the layout to JAVA_LONG_UNALIGNED for reading (data is still aligned) and I saw a large difference in performance where I didn't expect one, and it seems to boil down to the computed index

 `endOffset[i+1]` access, not for the `[i]` case. My expectation would have been that all variants exhibit the same performance, since alignment checks would be moved out of the loop.<br>

<br>

</div>

<div>A micro-benchmark (attached) to demonstrate:</div>

<div>long-aligned memory segment, looping over the same elements in 6 different ways:</div>

<div>{aligned, unaligned} x {segment[i] , segment[i+1],  segment[i+1] (w/ base offset) } gives very different results for aligned[i+1] (but not for aligned[i]):<br>

<br>

Benchmark                         Mode  Cnt    Score   Error  Units<br>

Alignment.findAligned            thrpt       217.050          ops/s<br>

Alignment.findAlignedPlusOne     thrpt       110.366          ops/s. <= #####<br>

Alignment.findAlignedNext    thrpt       110.377          ops/s. <= #####<br>

Alignment.findUnaligned          thrpt       216.591          ops/s<br>

Alignment.findUnalignedPlusOne   thrpt       215.843          ops/s<br>

</div>

<div>Alignment.findUnalignedNext  thrpt       216.483          ops/s</div>

<div>

<div><br>

</div>

</div>

<div>openjdk version "23.0.1" 2024-10-15<br>

OpenJDK Runtime Environment (build 23.0.1+11-39)<br>

OpenJDK 64-Bit Server VM (build 23.0.1+11-39, mixed mode, sharing)</div>

<div>Macbook Air M3</div>

<div><br>

</div>

<div>Needless to say that the difference was smaller with more app code in play, but large enough to give me pause. Likely it wouldn't matter at all but I want to have a better idea which design choices to pay attention to. With the foreign memory api, I find it

 a bit difficult to distinguish convenience from performance-relevant options (e.g. using path expressions vs computed offsets vs using a base offset. Besides "make layouts and varhandles static final" what would be other rules of thumb?)</div>

<div><br>

</div>

<div>Thx</div>

<font color="#888888">

<div>Matthias</div>

<div><br>

</div>

</font></div>

</div>

</div>


</div></blockquote></div>

</blockquote></div>

</blockquote></div>

</blockquote></div>