<div dir="ltr">Oops, my bad, I was referring to <a href="https://chriswhocodes.com/hsdis/">https://chriswhocodes.com/hsdis/</a> and I did not double-check the author!<div><br></div></div><br><div class="gmail_quote gmail_quote_container"><div dir="ltr" class="gmail_attr">On Thu, Dec 19, 2024 at 3:29 PM Jorn Vernee <<a href="mailto:jorn.vernee@oracle.com">jorn.vernee@oracle.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><u></u>
<div>
<p>> Jorn's precompiled dylib</p>
<p>Sorry, what? As far as I know I don't distribute precompiled
hsdis binaries :D</p>
<p>Jorn<br>
</p>
<div>On 19-12-2024 14:41, Matthias Ernst
wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">Thanks a lot for rewriting/reproducing!
<div><br>
</div>
<div>
<div>I've in the meantime tried to take some more complexity
out:</div>
<div>* replaced arrayElementVarHandle (which uses
MemoryLayout#scale with exact math) with a "plain" version
(`byteSize() * index`, at the risk of silent overflows).</div>
<div>* I also eliminated the
"VarHandle#collectCoordinates(h,1, scale)" in favor of a
plain varHandle.get(segment, i * byteSize()), after
verifying they have identical performance.</div>
<div><br>
</div>
<div>So we're down to a plain "VarHandle.get(segment, i *
byteSize)" in four combinations: ({aligned, unaligned} x
{i, i+1}), and the original observation still holds:<br>
the combo "aligned read" with "i + 1" somehow trips C2:</div>
<div>Alignment.aligned avgt 0.151 ns/op<br>
Alignment.alignedPlus1 avgt 0.298 ns/op<br>
Alignment.unaligned avgt 0.153 ns/op<br>
</div>
<div>
<div>Alignment.unalignedPlus1 avgt 0.153
ns/op<br>
</div>
</div>
<div><br>
</div>
<div>This leaves us with the assumption that the alignment
check for the +1 case somehow evades C2 optimization.</div>
<div>And indeed, we can remove all VarHandle business and only
benchmark the alignment check, and see that it fails to
recognize that it is loop-invariant:</div>
<div><br>
</div>
</div>
<div> // simplified copy of
AbstractMemorySegmentImpl#isAlignedForElement</div>
<div> private static boolean isAligned(MemorySegment segment,
long offset, long byteAlignment) {<br>
return (((segment.address() + offset)) &
(byteAlignment - 1)) == 0;<br>
}<br>
</div>
<div><br>
</div>
<div> @Benchmark<br>
public void isAligned() {<br>
for (long i = 1; i < COUNT; ++i) {<br>
if (!isAligned(segment, i * 8, 8)) throw new
IllegalArgumentException();<br>
}<br>
}<br>
<br>
@Benchmark<br>
public void isAlignedPlus1() {<br>
for (long i = 0; i < COUNT - 1; ++i) {<br>
if (!isAligned(segment, (i + 1) * 8, 8)) throw new
IllegalArgumentException();<br>
}<br>
}<br>
<br>
=><br>
<br>
Alignment.isAligned thrpt 35160425.491
ops/ns<br>
Alignment.isAlignedPlus1 thrpt 7.242
ops/ns<br>
<br>
</div>
<div>So this seems to be the culprit. Is it an issue? Idk. Using
a plain offset multiplication instead of the
overflow-protecting MemoryLayout#scale actually seems to have
a bigger impact on performance than this.</div>
<div><br>
</div>
<div>> hsdis</div>
<div><br>
</div>
<div>I actually looked at hsdis and tried to use Jorn's
precompiled dylib, but it doesn't seem to load for me.
Installing the whole toolchain and building my own is probably
beyond what I'm trying to do here (esp since I'm not even sure
I could read the assembly...)</div>
<div><br>
</div>
<div>Matthias</div>
<div><br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Wed, Dec 18, 2024 at
3:13 PM Per-Ake Minborg <<a href="mailto:per-ake.minborg@oracle.com" target="_blank">per-ake.minborg@oracle.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<div dir="ltr">
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Hi Matthias!<br>
<br>
I've rewritten the benchmark slightly (just to make them
"normalized" the way we use to write them) even though
your benchmarks work equally well. See attachment. By
using the commands <br>
<br>
</div>
<div style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace;font-size:12pt;color:rgb(0,0,0)">
jvmArgsAppend = {</div>
<pre><div style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace;font-size:12pt;color:rgb(0,0,0)"> "-XX:+PrintCompilation",
"-XX:+UnlockDiagnosticVMOptions",
"-XX:+PrintInlining" }</div></pre>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
in a <span style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace">
@Fork</span> annotation, and observing the output, it
appears all the methods are inlined properly. So, even
though some methods are more complex, it appears they
are treated in the same way when it comes to inlining.<br>
<br>
By looking at the actual assembly generated for the
benchmarks using these commands (for an M1 in my case):<br>
<br>
</div>
<pre><div style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace;font-size:12pt;color:rgb(0,0,0)">@Fork(value = 1, jvmArgsAppend = {
"-Xbatch",
"-XX:-TieredCompilation",
"-XX:CompileCommand=dontinline,org.openjdk.bench.java.lang.foreign.Alignment::findAligned*",
"-XX:CompileCommand=PrintAssembly,org.openjdk.bench.java.lang.foreign.Alignment::findAligned*"
})</div></pre>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
I can see that the C2 compiler is able to unroll the
segment access in the <span style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace">
findAligned</span> method but not in the <span style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace">
findAlignedNext method. </span>This is one reason<span style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace"> findAligned</span> is
faster. In order to see real assembly, the
"hsdis-aarch64.dylib" must be present and it is
recommended to use a "fast-debug" version of the JDK.
Read more on Jorn Vernee's blog here: <a href="https://jornvernee.github.io/hsdis/2022/04/30/hsdis.html" id="m_-6374670115747780125m_-2362717009702851623m_4218413662007508028LPlnk828414" target="_blank">
https://jornvernee.github.io/hsdis/2022/04/30/hsdis.html</a></div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
The question then becomes why that is the case. This
drives us into another field of expertise where I am not
the right person to provide an answer. Generally, there
is no guarantee as to how the C2 compiler works and we
are improving it continuously. Maybe someone else can
provide additional information.</div>
<div style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Best, Per Minborg</div>
<div style="font-family:"Aptos Mono",Aptos_EmbeddedFont,Aptos_MSFontService,monospace;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
<br>
<br>
</div>
<hr style="display:inline-block;width:98%">
<div id="m_-6374670115747780125m_-2362717009702851623m_4218413662007508028divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> panama-dev <<a href="mailto:panama-dev-retn@openjdk.org" target="_blank">panama-dev-retn@openjdk.org</a>>
on behalf of Matthias Ernst <<a href="mailto:matthias@mernst.org" target="_blank">matthias@mernst.org</a>><br>
<b>Sent:</b> Wednesday, December 18, 2024 9:26 AM<br>
<b>To:</b> <a href="mailto:panama-dev@openjdk.org" target="_blank">panama-dev@openjdk.org</a>
<<a href="mailto:panama-dev@openjdk.org" target="_blank">panama-dev@openjdk.org</a>><br>
<b>Subject:</b> performance: arrayElementVarHandle /
calculated index / aligned vs unaligned</font>
<div> </div>
</div>
<div>
<div dir="ltr">Hi,
<div><br>
</div>
<div>I'm trying to use the foreign memory api to
interpret some variable-length encoded data, where
an offset vector encodes the start offset of each
stride. Accessing element `i` in this case involves
reading `offset[i+1]` in addition to `offset[i]`.
The offset vector is modeled as a
`JAVA_LONG.arrayElementVarHandle()`.</div>
<div><br>
</div>
<div>Just out of curiosity about bounds and alignment
checks I switched the layout to JAVA_LONG_UNALIGNED
for reading (data is still aligned) and I saw a
large difference in performance where I didn't
expect one, and it seems to boil down to the
computed index `endOffset[i+1]` access, not for the
`[i]` case. My expectation would have been that all
variants exhibit the same performance, since
alignment checks would be moved out of the loop.<br>
<br>
</div>
<div>A micro-benchmark (attached) to demonstrate:</div>
<div>long-aligned memory segment, looping over the
same elements in 6 different ways:</div>
<div>{aligned, unaligned} x {segment[i] ,
segment[i+1], segment[i+1] (w/ base offset) } gives
very different results for aligned[i+1] (but not for
aligned[i]):<br>
<br>
Benchmark Mode Cnt Score
Error Units<br>
Alignment.findAligned thrpt 217.050
ops/s<br>
Alignment.findAlignedPlusOne thrpt 110.366
ops/s. <= #####<br>
Alignment.findAlignedNext thrpt 110.377
ops/s. <= #####<br>
Alignment.findUnaligned thrpt 216.591
ops/s<br>
Alignment.findUnalignedPlusOne thrpt 215.843
ops/s<br>
</div>
<div>Alignment.findUnalignedNext thrpt 216.483
ops/s</div>
<div>
<div><br>
</div>
</div>
<div>openjdk version "23.0.1" 2024-10-15<br>
OpenJDK Runtime Environment (build 23.0.1+11-39)<br>
OpenJDK 64-Bit Server VM (build 23.0.1+11-39, mixed
mode, sharing)</div>
<div>Macbook Air M3</div>
<div><br>
</div>
<div>Needless to say that the difference was smaller
with more app code in play, but large enough to give
me pause. Likely it wouldn't matter at all but I
want to have a better idea which design choices to
pay attention to. With the foreign memory api, I
find it a bit difficult to distinguish convenience
from performance-relevant options (e.g. using path
expressions vs computed offsets vs using a base
offset. Besides "make layouts and varhandles static
final" what would be other rules of thumb?)</div>
<div><br>
</div>
<div>Thx</div>
<font color="#888888">
<div>Matthias</div>
<div><br>
</div>
</font></div>
</div>
</div>
</div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote></div>