[External] : Re: performance: arrayElementVarHandle / calculated index / aligned vs unaligned
Per-Ake Minborg
per-ake.minborg at oracle.com
Thu Dec 19 13:58:13 UTC 2024
Hi again Matthias,
I got some help from the team and we have identified several improvements we can make including improving the C2 patterns recognized and generating a VarHandle that only checks alignment on the base offset and not each individual array element. Also, there was a bug hiding in this area.
We are tracking these issues in the bug system. So far, we have:
https://bugs.openjdk.org/browse/JDK-8346623
https://bugs.openjdk.org/browse/JDK-8346651
We will look at these in 2025.
Thanks again for identifying the problem Matthias!
Best, Per Minborg
________________________________
From: Matthias Ernst <matthias at mernst.org>
Sent: Thursday, December 19, 2024 2:41 PM
To: Per-Ake Minborg <per-ake.minborg at oracle.com>
Cc: panama-dev at openjdk.org <panama-dev at openjdk.org>; Joe Forkey <joe.forkey at oracle.com>
Subject: [External] : Re: performance: arrayElementVarHandle / calculated index / aligned vs unaligned
Thanks a lot for rewriting/reproducing!
I've in the meantime tried to take some more complexity out:
* replaced arrayElementVarHandle (which uses MemoryLayout#scale with exact math) with a "plain" version (`byteSize() * index`, at the risk of silent overflows).
* I also eliminated the "VarHandle#collectCoordinates(h,1, scale)" in favor of a plain varHandle.get(segment, i * byteSize()), after verifying they have identical performance.
So we're down to a plain "VarHandle.get(segment, i * byteSize)" in four combinations: ({aligned, unaligned} x {i, i+1}), and the original observation still holds:
the combo "aligned read" with "i + 1" somehow trips C2:
Alignment.aligned avgt 0.151 ns/op
Alignment.alignedPlus1 avgt 0.298 ns/op
Alignment.unaligned avgt 0.153 ns/op
Alignment.unalignedPlus1 avgt 0.153 ns/op
This leaves us with the assumption that the alignment check for the +1 case somehow evades C2 optimization.
And indeed, we can remove all VarHandle business and only benchmark the alignment check, and see that it fails to recognize that it is loop-invariant:
// simplified copy of AbstractMemorySegmentImpl#isAlignedForElement
private static boolean isAligned(MemorySegment segment, long offset, long byteAlignment) {
return (((segment.address() + offset)) & (byteAlignment - 1)) == 0;
}
@Benchmark
public void isAligned() {
for (long i = 1; i < COUNT; ++i) {
if (!isAligned(segment, i * 8, 8)) throw new IllegalArgumentException();
}
}
@Benchmark
public void isAlignedPlus1() {
for (long i = 0; i < COUNT - 1; ++i) {
if (!isAligned(segment, (i + 1) * 8, 8)) throw new IllegalArgumentException();
}
}
=>
Alignment.isAligned thrpt 35160425.491 ops/ns
Alignment.isAlignedPlus1 thrpt 7.242 ops/ns
So this seems to be the culprit. Is it an issue? Idk. Using a plain offset multiplication instead of the overflow-protecting MemoryLayout#scale actually seems to have a bigger impact on performance than this.
> hsdis
I actually looked at hsdis and tried to use Jorn's precompiled dylib, but it doesn't seem to load for me. Installing the whole toolchain and building my own is probably beyond what I'm trying to do here (esp since I'm not even sure I could read the assembly...)
Matthias
On Wed, Dec 18, 2024 at 3:13 PM Per-Ake Minborg <per-ake.minborg at oracle.com<mailto:per-ake.minborg at oracle.com>> wrote:
Hi Matthias!
I've rewritten the benchmark slightly (just to make them "normalized" the way we use to write them) even though your benchmarks work equally well. See attachment. By using the commands
jvmArgsAppend = {
"-XX:+PrintCompilation",
"-XX:+UnlockDiagnosticVMOptions",
"-XX:+PrintInlining" }
in a @Fork annotation, and observing the output, it appears all the methods are inlined properly. So, even though some methods are more complex, it appears they are treated in the same way when it comes to inlining.
By looking at the actual assembly generated for the benchmarks using these commands (for an M1 in my case):
@Fork(value = 1, jvmArgsAppend = {
"-Xbatch",
"-XX:-TieredCompilation",
"-XX:CompileCommand=dontinline,org.openjdk.bench.java.lang.foreign.Alignment::findAligned*",
"-XX:CompileCommand=PrintAssembly,org.openjdk.bench.java.lang.foreign.Alignment::findAligned*"
})
I can see that the C2 compiler is able to unroll the segment access in the findAligned method but not in the findAlignedNext method. This is one reason findAligned is faster. In order to see real assembly, the "hsdis-aarch64.dylib" must be present and it is recommended to use a "fast-debug" version of the JDK. Read more on Jorn Vernee's blog here: https://jornvernee.github.io/hsdis/2022/04/30/hsdis.html<https://urldefense.com/v3/__https://jornvernee.github.io/hsdis/2022/04/30/hsdis.html__;!!ACWV5N9M2RV99hQ!L5jjDQEoWQxJJ8Ye1I0J85IKJEiGlShXiLAGzFfz1sCLm8VPLZDa5sH4mAhLJSqdaCPhZR950-uf4t5JlpCy170$>
The question then becomes why that is the case. This drives us into another field of expertise where I am not the right person to provide an answer. Generally, there is no guarantee as to how the C2 compiler works and we are improving it continuously. Maybe someone else can provide additional information.
Best, Per Minborg
________________________________
From: panama-dev <panama-dev-retn at openjdk.org<mailto:panama-dev-retn at openjdk.org>> on behalf of Matthias Ernst <matthias at mernst.org<mailto:matthias at mernst.org>>
Sent: Wednesday, December 18, 2024 9:26 AM
To: panama-dev at openjdk.org<mailto:panama-dev at openjdk.org> <panama-dev at openjdk.org<mailto:panama-dev at openjdk.org>>
Subject: performance: arrayElementVarHandle / calculated index / aligned vs unaligned
Hi,
I'm trying to use the foreign memory api to interpret some variable-length encoded data, where an offset vector encodes the start offset of each stride. Accessing element `i` in this case involves reading `offset[i+1]` in addition to `offset[i]`. The offset vector is modeled as a `JAVA_LONG.arrayElementVarHandle()`.
Just out of curiosity about bounds and alignment checks I switched the layout to JAVA_LONG_UNALIGNED for reading (data is still aligned) and I saw a large difference in performance where I didn't expect one, and it seems to boil down to the computed index `endOffset[i+1]` access, not for the `[i]` case. My expectation would have been that all variants exhibit the same performance, since alignment checks would be moved out of the loop.
A micro-benchmark (attached) to demonstrate:
long-aligned memory segment, looping over the same elements in 6 different ways:
{aligned, unaligned} x {segment[i] , segment[i+1], segment[i+1] (w/ base offset) } gives very different results for aligned[i+1] (but not for aligned[i]):
Benchmark Mode Cnt Score Error Units
Alignment.findAligned thrpt 217.050 ops/s
Alignment.findAlignedPlusOne thrpt 110.366 ops/s. <= #####
Alignment.findAlignedNext thrpt 110.377 ops/s. <= #####
Alignment.findUnaligned thrpt 216.591 ops/s
Alignment.findUnalignedPlusOne thrpt 215.843 ops/s
Alignment.findUnalignedNext thrpt 216.483 ops/s
openjdk version "23.0.1" 2024-10-15
OpenJDK Runtime Environment (build 23.0.1+11-39)
OpenJDK 64-Bit Server VM (build 23.0.1+11-39, mixed mode, sharing)
Macbook Air M3
Needless to say that the difference was smaller with more app code in play, but large enough to give me pause. Likely it wouldn't matter at all but I want to have a better idea which design choices to pay attention to. With the foreign memory api, I find it a bit difficult to distinguish convenience from performance-relevant options (e.g. using path expressions vs computed offsets vs using a base offset. Besides "make layouts and varhandles static final" what would be other rules of thumb?)
Thx
Matthias
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20241219/330a128c/attachment-0001.htm>
More information about the panama-dev
mailing list