performance: arrayElementVarHandle / calculated index / aligned vs unaligned
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Thu Jan 23 10:19:38 UTC 2025
Hi,
I believe Emmanuel and Vladimir should be able to help on this (I see
some comments in the PR already)
Maurizio
On 23/01/2025 07:04, Matthias Ernst wrote:
> Could someone help me move this fix
> (https://github.com/openjdk/jdk/pull/22856) over the finish line? I
> don't think we should leave this performance on the table for FFM var
> handles. Thanks!
>
>
> On Sat, Dec 21, 2024 at 3:17 PM Matthias Ernst <matthias at mernst.org>
> wrote:
>
> I see there's a new issue for this:
> https://bugs.openjdk.org/browse/JDK-8346664.
> Started working on a fix: https://github.com/openjdk/jdk/pull/22856
>
>
> On Fri, Dec 20, 2024 at 1:56 PM Matthias Ernst
> <matthias at mernst.org> wrote:
>
> > alignment check for the +1 case somehow evades C2 optimization.
>
> I believe I may understand how this is happening (disclaimer:
> chemistry dog reading openjdk source, apologies if I'm missing
> something):
>
> A dedicated transformation to simplify (base + offset <<
> shift) & mask type expressions was actually introduced for
> Panama in https://github.com/openjdk/jdk/pull/6697:
> https://github.com/openjdk/jdk/blob/cf28fd4cbc6507eb69fcfeb33622316eb5b6b0c5/src/hotspot/share/opto/mulnode.cpp#L2128
> It requires the expression to be a variant of AND(ADD(...,
> SHIFT(offset, shift), mask).
>
> This is what turns (offset + i << 8) & 7 into a loop-invariant.
>
> However, before this pattern is checked, a "shift" node like
> ((i+1) << 8) gets expanded into `i << 8 + 8` here:
> https://github.com/openjdk/jdk/blob/cf28fd4cbc6507eb69fcfeb33622316eb5b6b0c5/src/hotspot/share/opto/mulnode.cpp#L961
>
> Now the node contains nested ADDs and no longer matches the
> pattern: AND(ADD(..., ADD(SHIFT(offset, shift), shift)), mask) .
>
> We can defeat the expansion by using a non-constant "1":
> class Aligmnent {
> long one = 1;
> handle.get(segment, (i+one) << 8) // magically faster than (i+1)
> }
>
> For a fix, one could possibly make
> AndIL_shift_and_mask_is_always_zero recursively descend into
> the ADD tree.
>
>
> On Thu, Dec 19, 2024 at 2:41 PM Matthias Ernst
> <matthias at mernst.org> wrote:
>
> Thanks a lot for rewriting/reproducing!
>
> I've in the meantime tried to take some more complexity out:
> * replaced arrayElementVarHandle (which uses
> MemoryLayout#scale with exact math) with a "plain" version
> (`byteSize() * index`, at the risk of silent overflows).
> * I also eliminated the "VarHandle#collectCoordinates(h,1,
> scale)" in favor of a plain varHandle.get(segment, i *
> byteSize()), after verifying they have identical performance.
>
> So we're down to a plain "VarHandle.get(segment, i *
> byteSize)" in four combinations: ({aligned, unaligned} x
> {i, i+1}), and the original observation still holds:
> the combo "aligned read" with "i + 1" somehow trips C2:
> Alignment.aligned avgt 0.151 ns/op
> Alignment.alignedPlus1 avgt 0.298 ns/op
> Alignment.unaligned avgt 0.153 ns/op
> Alignment.unalignedPlus1 avgt 0.153 ns/op
>
> This leaves us with the assumption that the alignment
> check for the +1 case somehow evades C2 optimization.
> And indeed, we can remove all VarHandle business and only
> benchmark the alignment check, and see that it fails to
> recognize that it is loop-invariant:
>
> // simplified copy of
> AbstractMemorySegmentImpl#isAlignedForElement
> private static boolean isAligned(MemorySegment
> segment, long offset, long byteAlignment) {
> return (((segment.address() + offset)) &
> (byteAlignment - 1)) == 0;
> }
>
> @Benchmark
> public void isAligned() {
> for (long i = 1; i < COUNT; ++i) {
> if (!isAligned(segment, i * 8, 8)) throw new
> IllegalArgumentException();
> }
> }
>
> @Benchmark
> public void isAlignedPlus1() {
> for (long i = 0; i < COUNT - 1; ++i) {
> if (!isAligned(segment, (i + 1) * 8, 8)) throw
> new IllegalArgumentException();
> }
> }
>
> =>
>
> Alignment.isAligned thrpt 35160425.491
> ops/ns
> Alignment.isAlignedPlus1 thrpt 7.242
> ops/ns
>
> So this seems to be the culprit. Is it an issue? Idk.
> Using a plain offset multiplication instead of the
> overflow-protecting MemoryLayout#scale actually seems to
> have a bigger impact on performance than this.
>
> > hsdis
>
> I actually looked at hsdis and tried to use Jorn's
> precompiled dylib, but it doesn't seem to load for me.
> Installing the whole toolchain and building my own is
> probably beyond what I'm trying to do here (esp since I'm
> not even sure I could read the assembly...)
>
> Matthias
>
>
> On Wed, Dec 18, 2024 at 3:13 PM Per-Ake Minborg
> <per-ake.minborg at oracle.com> wrote:
>
> Hi Matthias!
>
> I've rewritten the benchmark slightly (just to make
> them "normalized" the way we use to write them) even
> though your benchmarks work equally well. See
> attachment. By using the commands
>
> jvmArgsAppend = {
>
> "-XX:+PrintCompilation",
> "-XX:+UnlockDiagnosticVMOptions",
> "-XX:+PrintInlining" }
>
> in a @Fork annotation, and observing the output, it
> appears all the methods are inlined properly. So, even
> though some methods are more complex, it appears they
> are treated in the same way when it comes to inlining.
>
> By looking at the actual assembly generated for the
> benchmarks using these commands (for an M1 in my case):
>
> @Fork(value = 1, jvmArgsAppend = { "-Xbatch",
> "-XX:-TieredCompilation",
> "-XX:CompileCommand=dontinline,org.openjdk.bench.java.lang.foreign.Alignment::findAligned*",
>
> "-XX:CompileCommand=PrintAssembly,org.openjdk.bench.java.lang.foreign.Alignment::findAligned*"
> })
>
> I can see that the C2 compiler is able to unroll the
> segment access in the findAligned method but not in
> the findAlignedNext method. This is one
> reason findAligned is faster. In order to see real
> assembly, the "hsdis-aarch64.dylib" must be present
> and it is recommended to use a "fast-debug" version of
> the JDK. Read more on Jorn Vernee's blog here:
> https://jornvernee.github.io/hsdis/2022/04/30/hsdis.html
>
>
> The question then becomes why that is the case. This
> drives us into another field of expertise where I am
> not the right person to provide an answer. Generally,
> there is no guarantee as to how the C2 compiler works
> and we are improving it continuously. Maybe someone
> else can provide additional information.
>
> Best, Per Minborg
>
>
>
>
> ------------------------------------------------------------------------
> *From:* panama-dev <panama-dev-retn at openjdk.org> on
> behalf of Matthias Ernst <matthias at mernst.org>
> *Sent:* Wednesday, December 18, 2024 9:26 AM
> *To:* panama-dev at openjdk.org <panama-dev at openjdk.org>
> *Subject:* performance: arrayElementVarHandle /
> calculated index / aligned vs unaligned
> Hi,
>
> I'm trying to use the foreign memory api to interpret
> some variable-length encoded data, where an offset
> vector encodes the start offset of each stride.
> Accessing element `i` in this case involves reading
> `offset[i+1]` in addition to `offset[i]`. The offset
> vector is modeled as a
> `JAVA_LONG.arrayElementVarHandle()`.
>
> Just out of curiosity about bounds and alignment
> checks I switched the layout to JAVA_LONG_UNALIGNED
> for reading (data is still aligned) and I saw a large
> difference in performance where I didn't expect
> one, and it seems to boil down to the computed index
> `endOffset[i+1]` access, not for the `[i]` case. My
> expectation would have been that all variants exhibit
> the same performance, since alignment checks would be
> moved out of the loop.
>
> A micro-benchmark (attached) to demonstrate:
> long-aligned memory segment, looping over the same
> elements in 6 different ways:
> {aligned, unaligned} x {segment[i] , segment[i+1],
> segment[i+1] (w/ base offset) } gives very different
> results for aligned[i+1] (but not for aligned[i]):
>
> Benchmark Mode Cnt Score
> Error Units
> Alignment.findAligned thrpt 217.050
> ops/s
> Alignment.findAlignedPlusOne thrpt 110.366
> ops/s. <= #####
> Alignment.findAlignedNext thrpt 110.377
> ops/s. <= #####
> Alignment.findUnaligned thrpt 216.591
> ops/s
> Alignment.findUnalignedPlusOne thrpt 215.843
> ops/s
> Alignment.findUnalignedNext thrpt 216.483
> ops/s
>
> openjdk version "23.0.1" 2024-10-15
> OpenJDK Runtime Environment (build 23.0.1+11-39)
> OpenJDK 64-Bit Server VM (build 23.0.1+11-39, mixed
> mode, sharing)
> Macbook Air M3
>
> Needless to say that the difference was smaller with
> more app code in play, but large enough to give me
> pause. Likely it wouldn't matter at all but I want to
> have a better idea which design choices to pay
> attention to. With the foreign memory api, I find it a
> bit difficult to distinguish convenience from
> performance-relevant options (e.g. using path
> expressions vs computed offsets vs using a base
> offset. Besides "make layouts and varhandles static
> final" what would be other rules of thumb?)
>
> Thx
> Matthias
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20250123/d75ccd8a/attachment-0001.htm>
More information about the panama-dev
mailing list