performance: arrayElementVarHandle / calculated index / aligned vs unaligned
Jorn Vernee
jorn.vernee at oracle.com
Thu Dec 19 14:29:24 UTC 2024
> Jorn's precompiled dylib
Sorry, what? As far as I know I don't distribute precompiled hsdis
binaries :D
Jorn
On 19-12-2024 14:41, Matthias Ernst wrote:
> Thanks a lot for rewriting/reproducing!
>
> I've in the meantime tried to take some more complexity out:
> * replaced arrayElementVarHandle (which uses MemoryLayout#scale with
> exact math) with a "plain" version (`byteSize() * index`, at the risk
> of silent overflows).
> * I also eliminated the "VarHandle#collectCoordinates(h,1, scale)" in
> favor of a plain varHandle.get(segment, i * byteSize()), after
> verifying they have identical performance.
>
> So we're down to a plain "VarHandle.get(segment, i * byteSize)" in
> four combinations: ({aligned, unaligned} x {i, i+1}), and the original
> observation still holds:
> the combo "aligned read" with "i + 1" somehow trips C2:
> Alignment.aligned avgt 0.151 ns/op
> Alignment.alignedPlus1 avgt 0.298 ns/op
> Alignment.unaligned avgt 0.153 ns/op
> Alignment.unalignedPlus1 avgt 0.153 ns/op
>
> This leaves us with the assumption that the alignment check for the +1
> case somehow evades C2 optimization.
> And indeed, we can remove all VarHandle business and only benchmark
> the alignment check, and see that it fails to recognize that it is
> loop-invariant:
>
> // simplified copy of AbstractMemorySegmentImpl#isAlignedForElement
> private static boolean isAligned(MemorySegment segment, long
> offset, long byteAlignment) {
> return (((segment.address() + offset)) & (byteAlignment - 1))
> == 0;
> }
>
> @Benchmark
> public void isAligned() {
> for (long i = 1; i < COUNT; ++i) {
> if (!isAligned(segment, i * 8, 8)) throw new
> IllegalArgumentException();
> }
> }
>
> @Benchmark
> public void isAlignedPlus1() {
> for (long i = 0; i < COUNT - 1; ++i) {
> if (!isAligned(segment, (i + 1) * 8, 8)) throw new
> IllegalArgumentException();
> }
> }
>
> =>
>
> Alignment.isAligned thrpt 35160425.491 ops/ns
> Alignment.isAlignedPlus1 thrpt 7.242 ops/ns
>
> So this seems to be the culprit. Is it an issue? Idk. Using a plain
> offset multiplication instead of the overflow-protecting
> MemoryLayout#scale actually seems to have a bigger impact on
> performance than this.
>
> > hsdis
>
> I actually looked at hsdis and tried to use Jorn's precompiled dylib,
> but it doesn't seem to load for me. Installing the whole toolchain and
> building my own is probably beyond what I'm trying to do here (esp
> since I'm not even sure I could read the assembly...)
>
> Matthias
>
>
> On Wed, Dec 18, 2024 at 3:13 PM Per-Ake Minborg
> <per-ake.minborg at oracle.com> wrote:
>
> Hi Matthias!
>
> I've rewritten the benchmark slightly (just to make them
> "normalized" the way we use to write them) even though your
> benchmarks work equally well. See attachment. By using the commands
>
> jvmArgsAppend = {
>
> "-XX:+PrintCompilation",
> "-XX:+UnlockDiagnosticVMOptions", "-XX:+PrintInlining" }
>
> in a @Fork annotation, and observing the output, it appears all
> the methods are inlined properly. So, even though some methods are
> more complex, it appears they are treated in the same way when it
> comes to inlining.
>
> By looking at the actual assembly generated for the benchmarks
> using these commands (for an M1 in my case):
>
> @Fork(value = 1, jvmArgsAppend = { "-Xbatch",
> "-XX:-TieredCompilation",
> "-XX:CompileCommand=dontinline,org.openjdk.bench.java.lang.foreign.Alignment::findAligned*",
>
> "-XX:CompileCommand=PrintAssembly,org.openjdk.bench.java.lang.foreign.Alignment::findAligned*"
> })
>
> I can see that the C2 compiler is able to unroll the segment
> access in the findAligned method but not in the findAlignedNext
> method. This is one reason findAligned is faster. In order to see
> real assembly, the "hsdis-aarch64.dylib" must be present and it is
> recommended to use a "fast-debug" version of the JDK. Read more on
> Jorn Vernee's blog here:
> https://jornvernee.github.io/hsdis/2022/04/30/hsdis.html
>
>
> The question then becomes why that is the case. This drives us
> into another field of expertise where I am not the right person to
> provide an answer. Generally, there is no guarantee as to how the
> C2 compiler works and we are improving it continuously. Maybe
> someone else can provide additional information.
>
> Best, Per Minborg
>
>
>
>
> ------------------------------------------------------------------------
> *From:* panama-dev <panama-dev-retn at openjdk.org> on behalf of
> Matthias Ernst <matthias at mernst.org>
> *Sent:* Wednesday, December 18, 2024 9:26 AM
> *To:* panama-dev at openjdk.org <panama-dev at openjdk.org>
> *Subject:* performance: arrayElementVarHandle / calculated index /
> aligned vs unaligned
> Hi,
>
> I'm trying to use the foreign memory api to interpret some
> variable-length encoded data, where an offset vector encodes the
> start offset of each stride. Accessing element `i` in this case
> involves reading `offset[i+1]` in addition to `offset[i]`. The
> offset vector is modeled as a `JAVA_LONG.arrayElementVarHandle()`.
>
> Just out of curiosity about bounds and alignment checks I switched
> the layout to JAVA_LONG_UNALIGNED for reading (data is still
> aligned) and I saw a large difference in performance where I
> didn't expect one, and it seems to boil down to the computed index
> `endOffset[i+1]` access, not for the `[i]` case. My expectation
> would have been that all variants exhibit the same
> performance, since alignment checks would be moved out of the loop.
>
> A micro-benchmark (attached) to demonstrate:
> long-aligned memory segment, looping over the same elements in 6
> different ways:
> {aligned, unaligned} x {segment[i] , segment[i+1], segment[i+1]
> (w/ base offset) } gives very different results for aligned[i+1]
> (but not for aligned[i]):
>
> Benchmark Mode Cnt Score Error Units
> Alignment.findAligned thrpt 217.050 ops/s
> Alignment.findAlignedPlusOne thrpt 110.366
> ops/s. <= #####
> Alignment.findAlignedNext thrpt 110.377 ops/s. <= #####
> Alignment.findUnaligned thrpt 216.591 ops/s
> Alignment.findUnalignedPlusOne thrpt 215.843 ops/s
> Alignment.findUnalignedNext thrpt 216.483 ops/s
>
> openjdk version "23.0.1" 2024-10-15
> OpenJDK Runtime Environment (build 23.0.1+11-39)
> OpenJDK 64-Bit Server VM (build 23.0.1+11-39, mixed mode, sharing)
> Macbook Air M3
>
> Needless to say that the difference was smaller with more app code
> in play, but large enough to give me pause. Likely it
> wouldn't matter at all but I want to have a better idea which
> design choices to pay attention to. With the foreign memory api, I
> find it a bit difficult to distinguish convenience from
> performance-relevant options (e.g. using path expressions vs
> computed offsets vs using a base offset. Besides "make layouts and
> varhandles static final" what would be other rules of thumb?)
>
> Thx
> Matthias
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20241219/af800115/attachment-0001.htm>
More information about the panama-dev
mailing list