<!DOCTYPE html><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<p>Hi,<br>
I believe Emmanuel and Vladimir should be able to help on this (I
see some comments in the PR already)</p>
<p>Maurizio<br>
</p>
<div class="moz-cite-prefix">On 23/01/2025 07:04, Matthias Ernst
wrote:<br>
</div>
<blockquote type="cite" cite="mid:CAKJ3wwEGO5gj7mF+hbY-Bt845XPK+GyMKZo43uRMwof3dBRiXw@mail.gmail.com">
<div dir="ltr">Could someone help me move this fix (<a href="https://github.com/openjdk/jdk/pull/22856" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/openjdk/jdk/pull/22856</a>)
over the finish line? I don't think we should leave this
performance on the table for FFM var handles. Thanks!
<div><br>
</div>
</div>
<br>
<div class="gmail_quote gmail_quote_container">
<div dir="ltr" class="gmail_attr">On Sat, Dec 21, 2024 at
3:17 PM Matthias Ernst <<a href="mailto:matthias@mernst.org" moz-do-not-send="true" class="moz-txt-link-freetext">matthias@mernst.org</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">I see there's a new issue for this: <a href="https://bugs.openjdk.org/browse/JDK-8346664" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://bugs.openjdk.org/browse/JDK-8346664</a>.
<div>Started working on a fix: <a href="https://github.com/openjdk/jdk/pull/22856" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/openjdk/jdk/pull/22856</a>
<div><br>
</div>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Fri, Dec 20, 2024 at
1:56 PM Matthias Ernst <<a href="mailto:matthias@mernst.org" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">matthias@mernst.org</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">> alignment check for the +1 case
somehow evades C2 optimization.<br>
<div><br>
</div>
<div>I believe I may understand how this is happening
(disclaimer: chemistry dog reading openjdk source,
apologies if I'm missing something):</div>
<div><br>
</div>
<div>A dedicated transformation to simplify (base +
offset << shift) & mask type expressions was
actually introduced for Panama in <a href="https://github.com/openjdk/jdk/pull/6697" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/openjdk/jdk/pull/6697</a>:</div>
<div><a href="https://github.com/openjdk/jdk/blob/cf28fd4cbc6507eb69fcfeb33622316eb5b6b0c5/src/hotspot/share/opto/mulnode.cpp#L2128" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/openjdk/jdk/blob/cf28fd4cbc6507eb69fcfeb33622316eb5b6b0c5/src/hotspot/share/opto/mulnode.cpp#L2128</a></div>
<div>It requires the expression to be a variant of
AND(ADD(..., SHIFT(offset, shift), mask).</div>
<div><br>
</div>
<div>This is what turns (offset + i << 8) & 7
into a loop-invariant.</div>
<div><br>
</div>
<div>However, before this pattern is checked, a "shift"
node like ((i+1) << 8) gets expanded into `i
<< 8 + 8` here: <a href="https://github.com/openjdk/jdk/blob/cf28fd4cbc6507eb69fcfeb33622316eb5b6b0c5/src/hotspot/share/opto/mulnode.cpp#L961" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">https://github.com/openjdk/jdk/blob/cf28fd4cbc6507eb69fcfeb33622316eb5b6b0c5/src/hotspot/share/opto/mulnode.cpp#L961</a></div>
<div><br>
</div>
<div>Now the node contains nested ADDs and no longer
matches the pattern: AND(ADD(..., ADD(SHIFT(offset,
shift), shift)), mask) . </div>
<div><br>
</div>
<div>We can defeat the expansion by using a non-constant
"1":</div>
<div>class Aligmnent {</div>
<div> long one = 1;</div>
<div> handle.get(segment, (i+one) << 8) //
magically faster than (i+1)</div>
<div>}</div>
<div><br>
</div>
<div>For a fix, one could possibly make
AndIL_shift_and_mask_is_always_zero recursively
descend into the ADD tree.</div>
<div><br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Thu, Dec 19, 2024
at 2:41 PM Matthias Ernst <<a href="mailto:matthias@mernst.org" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">matthias@mernst.org</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">Thanks a lot for rewriting/reproducing!
<div><br>
</div>
<div>
<div>I've in the meantime tried to take some more
complexity out:</div>
<div>* replaced arrayElementVarHandle (which uses
MemoryLayout#scale with exact math) with a
"plain" version (`byteSize() * index`, at the
risk of silent overflows).</div>
<div>* I also eliminated the
"VarHandle#collectCoordinates(h,1, scale)" in
favor of a plain varHandle.get(segment, i *
byteSize()), after verifying they have identical
performance.</div>
<div><br>
</div>
<div>So we're down to a plain
"VarHandle.get(segment, i * byteSize)" in four
combinations: ({aligned, unaligned} x {i, i+1}),
and the original observation still holds:<br>
the combo "aligned read" with "i + 1" somehow
trips C2:</div>
<div>Alignment.aligned avgt 0.151
ns/op<br>
Alignment.alignedPlus1 avgt 0.298
ns/op<br>
Alignment.unaligned avgt 0.153
ns/op<br>
</div>
<div>
<div>Alignment.unalignedPlus1 avgt 0.153
ns/op<br>
</div>
</div>
<div><br>
</div>
<div>This leaves us with the assumption that the
alignment check for the +1 case somehow evades
C2 optimization.</div>
<div>And indeed, we can remove all VarHandle
business and only benchmark the alignment check,
and see that it fails to recognize that it is
loop-invariant:</div>
<div><br>
</div>
</div>
<div> // simplified copy of
AbstractMemorySegmentImpl#isAlignedForElement</div>
<div> private static boolean
isAligned(MemorySegment segment, long offset, long
byteAlignment) {<br>
return (((segment.address() + offset))
& (byteAlignment - 1)) == 0;<br>
}<br>
</div>
<div><br>
</div>
<div> @Benchmark<br>
public void isAligned() {<br>
for (long i = 1; i < COUNT; ++i) {<br>
if (!isAligned(segment, i * 8, 8))
throw new IllegalArgumentException();<br>
}<br>
}<br>
<br>
@Benchmark<br>
public void isAlignedPlus1() {<br>
for (long i = 0; i < COUNT - 1; ++i) {<br>
if (!isAligned(segment, (i + 1) * 8,
8)) throw new IllegalArgumentException();<br>
}<br>
}<br>
<br>
=><br>
<br>
Alignment.isAligned thrpt 35160425.491
ops/ns<br>
Alignment.isAlignedPlus1 thrpt 7.242
ops/ns<br>
<br>
</div>
<div>So this seems to be the culprit. Is it an
issue? Idk. Using a plain offset multiplication
instead of the overflow-protecting
MemoryLayout#scale actually seems to have a bigger
impact on performance than this.</div>
<div><br>
</div>
<div>> hsdis</div>
<div><br>
</div>
<div>I actually looked at hsdis and tried to
use Jorn's precompiled dylib, but it doesn't seem
to load for me. Installing the whole toolchain and
building my own is probably beyond what I'm trying
to do here (esp since I'm not even sure I could
read the assembly...)</div>
<div><br>
</div>
<div>Matthias</div>
<div><br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Wed, Dec 18,
2024 at 3:13 PM Per-Ake Minborg <<a href="mailto:per-ake.minborg@oracle.com" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">per-ake.minborg@oracle.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<div dir="ltr">
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Hi Matthias!<br>
<br>
I've rewritten the benchmark slightly (just
to make them "normalized" the way we use to
write them) even though your benchmarks work
equally well. See attachment. By using the
commands <br>
<br>
</div>
<div style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace;font-size:12pt;color:rgb(0,0,0)">
jvmArgsAppend = {</div>
<pre><div style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace;font-size:12pt;color:rgb(0,0,0)"> "-XX:+PrintCompilation",
"-XX:+UnlockDiagnosticVMOptions",
"-XX:+PrintInlining" }</div></pre>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
in a <span style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace">
@Fork</span> annotation, and observing the
output, it appears all the methods are
inlined properly. So, even though some
methods are more complex, it appears they
are treated in the same way when it comes to
inlining.<br>
<br>
By looking at the actual assembly generated
for the benchmarks using these commands (for
an M1 in my case):<br>
<br>
</div>
<pre><div style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace;font-size:12pt;color:rgb(0,0,0)">@Fork(value = 1, jvmArgsAppend = {
"-Xbatch",
"-XX:-TieredCompilation",
"-XX:CompileCommand=dontinline,org.openjdk.bench.java.lang.foreign.Alignment::findAligned*",
"-XX:CompileCommand=PrintAssembly,org.openjdk.bench.java.lang.foreign.Alignment::findAligned*"
})</div></pre>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
I can see that the C2 compiler is able to
unroll the segment access in the <span style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace">
findAligned</span> method but not in the <span style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace">
findAlignedNext method. </span>This is
one reason<span style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace"> findAligned</span> is
faster. In order to see real assembly, the
"hsdis-aarch64.dylib" must be present and it
is recommended to use a "fast-debug" version
of the JDK. Read more on Jorn Vernee's blog
here: <a href="https://jornvernee.github.io/hsdis/2022/04/30/hsdis.html" id="m_8056903533551950494m_8138333552085303315m_1732376954563610074m_-7185004750415086813m_-2362717009702851623m_4218413662007508028LPlnk828414" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">
https://jornvernee.github.io/hsdis/2022/04/30/hsdis.html</a></div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
The question then becomes why that is the
case. This drives us into another field of
expertise where I am not the right person to
provide an answer. Generally, there is no
guarantee as to how the C2 compiler works
and we are improving it continuously. Maybe
someone else can provide additional
information.</div>
<div style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Best, Per Minborg</div>
<div style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
<br>
<br>
</div>
<hr style="display:inline-block;width:98%">
<div id="m_8056903533551950494m_8138333552085303315m_1732376954563610074m_-7185004750415086813m_-2362717009702851623m_4218413662007508028divRplyFwdMsg" dir="ltr"><font style="font-size:11pt" face="Calibri, sans-serif" color="#000000"><b>From:</b>
panama-dev <<a href="mailto:panama-dev-retn@openjdk.org" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">panama-dev-retn@openjdk.org</a>>
on behalf of Matthias Ernst <<a href="mailto:matthias@mernst.org" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">matthias@mernst.org</a>><br>
<b>Sent:</b> Wednesday, December 18, 2024
9:26 AM<br>
<b>To:</b> <a href="mailto:panama-dev@openjdk.org" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">panama-dev@openjdk.org</a>
<<a href="mailto:panama-dev@openjdk.org" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">panama-dev@openjdk.org</a>><br>
<b>Subject:</b> performance:
arrayElementVarHandle / calculated index /
aligned vs unaligned</font>
<div> </div>
</div>
<div>
<div dir="ltr">Hi,
<div><br>
</div>
<div>I'm trying to use the foreign memory
api to interpret some variable-length
encoded data, where an offset vector
encodes the start offset of each stride.
Accessing element `i` in this case
involves reading `offset[i+1]` in
addition to `offset[i]`. The offset
vector is modeled as a
`JAVA_LONG.arrayElementVarHandle()`.</div>
<div><br>
</div>
<div>Just out of curiosity about bounds
and alignment checks I switched the
layout to JAVA_LONG_UNALIGNED for
reading (data is still aligned) and I
saw a large difference in performance
where I didn't expect one, and it seems
to boil down to the computed index
`endOffset[i+1]` access, not for the
`[i]` case. My expectation would have
been that all variants exhibit the same
performance, since alignment checks
would be moved out of the loop.<br>
<br>
</div>
<div>A micro-benchmark (attached) to
demonstrate:</div>
<div>long-aligned memory segment, looping
over the same elements in 6 different
ways:</div>
<div>{aligned, unaligned} x {segment[i] ,
segment[i+1], segment[i+1] (w/ base
offset) } gives very different results
for aligned[i+1] (but not for
aligned[i]):<br>
<br>
Benchmark Mode
Cnt Score Error Units<br>
Alignment.findAligned thrpt
217.050 ops/s<br>
Alignment.findAlignedPlusOne thrpt
110.366 ops/s. <= #####<br>
Alignment.findAlignedNext thrpt
110.377 ops/s. <= #####<br>
Alignment.findUnaligned thrpt
216.591 ops/s<br>
Alignment.findUnalignedPlusOne thrpt
215.843 ops/s<br>
</div>
<div>Alignment.findUnalignedNext thrpt
216.483 ops/s</div>
<div>
<div><br>
</div>
</div>
<div>openjdk version "23.0.1" 2024-10-15<br>
OpenJDK Runtime Environment (build
23.0.1+11-39)<br>
OpenJDK 64-Bit Server VM (build
23.0.1+11-39, mixed mode, sharing)</div>
<div>Macbook Air M3</div>
<div><br>
</div>
<div>Needless to say that the
difference was smaller with more app
code in play, but large enough to give
me pause. Likely it wouldn't matter at
all but I want to have a better idea
which design choices to pay attention
to. With the foreign memory api, I
find it a bit difficult to distinguish
convenience from performance-relevant
options (e.g. using path expressions vs
computed offsets vs using a base offset.
Besides "make layouts and varhandles
static final" what would be other rules
of thumb?)</div>
<div><br>
</div>
<div>Thx</div>
<font color="#888888">
<div>Matthias</div>
<div><br>
</div>
</font></div>
</div>
</div>
</div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote>
</body>
</html>