performance: arrayElementVarHandle / calculated index / aligned vs unaligned

Thu Jan 23 10:19:38 UTC 2025

Hi,
I believe Emmanuel and Vladimir should be able to help on this (I see 
some comments in the PR already)

Maurizio

On 23/01/2025 07:04, Matthias Ernst wrote:
> Could someone help me move this fix 
> (https://github.com/openjdk/jdk/pull/22856) over the finish line? I 
> don't think we should leave this performance on the table for FFM var 
> handles. Thanks!
>
>
> On Sat, Dec 21, 2024 at 3:17 PM Matthias Ernst <matthias at mernst.org> 
> wrote:
>
>     I see there's a new issue for this:
>     https://bugs.openjdk.org/browse/JDK-8346664.
>     Started working on a fix: https://github.com/openjdk/jdk/pull/22856
>
>
>     On Fri, Dec 20, 2024 at 1:56 PM Matthias Ernst
>     <matthias at mernst.org> wrote:
>
>         > alignment check for the +1 case somehow evades C2 optimization.
>
>         I believe I may understand how this is happening (disclaimer:
>         chemistry dog reading openjdk source, apologies if I'm missing
>         something):
>
>         A dedicated transformation to simplify (base + offset <<
>         shift) & mask type expressions was actually introduced for
>         Panama in https://github.com/openjdk/jdk/pull/6697:
>         https://github.com/openjdk/jdk/blob/cf28fd4cbc6507eb69fcfeb33622316eb5b6b0c5/src/hotspot/share/opto/mulnode.cpp#L2128
>         It requires the expression to be a variant of AND(ADD(...,
>         SHIFT(offset, shift), mask).
>
>         This is what turns (offset + i << 8) & 7 into a loop-invariant.
>
>         However, before this pattern is checked, a "shift" node like
>         ((i+1) << 8) gets expanded into `i << 8 + 8` here:
>         https://github.com/openjdk/jdk/blob/cf28fd4cbc6507eb69fcfeb33622316eb5b6b0c5/src/hotspot/share/opto/mulnode.cpp#L961
>
>         Now the node contains nested ADDs and no longer matches the
>         pattern: AND(ADD(..., ADD(SHIFT(offset, shift), shift)), mask) .
>
>         We can defeat the expansion by using a non-constant "1":
>         class Aligmnent {
>           long one = 1;
>           handle.get(segment, (i+one) << 8) // magically faster than (i+1)
>         }
>
>         For a fix, one could possibly make
>         AndIL_shift_and_mask_is_always_zero recursively descend into
>         the ADD tree.
>
>
>         On Thu, Dec 19, 2024 at 2:41 PM Matthias Ernst
>         <matthias at mernst.org> wrote:
>
>             Thanks a lot for rewriting/reproducing!
>
>             I've in the meantime tried to take some more complexity out:
>             * replaced arrayElementVarHandle (which uses
>             MemoryLayout#scale with exact math) with a "plain" version
>             (`byteSize() * index`, at the risk of silent overflows).
>             * I also eliminated the "VarHandle#collectCoordinates(h,1,
>             scale)" in favor of a plain varHandle.get(segment, i *
>             byteSize()), after verifying they have identical performance.
>
>             So we're down to a plain "VarHandle.get(segment, i *
>             byteSize)"  in four combinations: ({aligned, unaligned} x
>             {i, i+1}), and the original observation still holds:
>             the combo "aligned read" with "i + 1" somehow trips C2:
>             Alignment.aligned         avgt       0.151        ns/op
>             Alignment.alignedPlus1    avgt       0.298    ns/op
>             Alignment.unaligned       avgt       0.153    ns/op
>             Alignment.unalignedPlus1  avgt       0.153          ns/op
>
>             This leaves us with the assumption that the alignment
>             check for the +1 case somehow evades C2 optimization.
>             And indeed, we can remove all VarHandle business and only
>             benchmark the alignment check, and see that it fails to
>             recognize that it is loop-invariant:
>
>                 // simplified copy of
>             AbstractMemorySegmentImpl#isAlignedForElement
>                 private static boolean isAligned(MemorySegment
>             segment, long offset, long byteAlignment) {
>                     return (((segment.address() + offset)) &
>             (byteAlignment - 1)) == 0;
>                 }
>
>                 @Benchmark
>                 public void isAligned() {
>                     for (long i = 1; i < COUNT; ++i) {
>                         if (!isAligned(segment, i * 8, 8)) throw new
>             IllegalArgumentException();
>                     }
>                 }
>
>                 @Benchmark
>                 public void isAlignedPlus1() {
>                     for (long i = 0; i < COUNT - 1; ++i) {
>                         if (!isAligned(segment, (i + 1) * 8, 8)) throw
>             new IllegalArgumentException();
>                     }
>                 }
>
>             =>
>
>             Alignment.isAligned       thrpt       35160425.491        
>              ops/ns
>             Alignment.isAlignedPlus1  thrpt              7.242        
>              ops/ns
>
>             So this seems to be the culprit. Is it an issue? Idk.
>             Using a plain offset multiplication instead of the
>             overflow-protecting MemoryLayout#scale actually seems to
>             have a bigger impact on performance than this.
>
>             > hsdis
>
>             I actually looked at hsdis and tried to use Jorn's
>             precompiled dylib, but it doesn't seem to load for me.
>             Installing the whole toolchain and building my own is
>             probably beyond what I'm trying to do here (esp since I'm
>             not even sure I could read the assembly...)
>
>             Matthias
>
>
>             On Wed, Dec 18, 2024 at 3:13 PM Per-Ake Minborg
>             <per-ake.minborg at oracle.com> wrote:
>
>                 Hi Matthias!
>
>                 I've rewritten the benchmark slightly (just to make
>                 them "normalized" the way we use to write them) even
>                 though your benchmarks work equally well. See
>                 attachment. By using the commands
>
>                 jvmArgsAppend = {
>
>                        "-XX:+PrintCompilation",      
>                  "-XX:+UnlockDiagnosticVMOptions",      
>                  "-XX:+PrintInlining" }
>
>                 in a @Fork annotation, and observing the output, it
>                 appears all the methods are inlined properly. So, even
>                 though some methods are more complex, it appears they
>                 are treated in the same way when it comes to inlining.
>
>                  By looking at the actual assembly generated for the
>                 benchmarks using these commands (for an M1 in my case):
>
>                 @Fork(value = 1, jvmArgsAppend = {        "-Xbatch",  
>                      "-XX:-TieredCompilation",      
>                  "-XX:CompileCommand=dontinline,org.openjdk.bench.java.lang.foreign.Alignment::findAligned*",
>                      
>                  "-XX:CompileCommand=PrintAssembly,org.openjdk.bench.java.lang.foreign.Alignment::findAligned*"
>                 })
>
>                 I can see that the C2 compiler is able to unroll the
>                 segment access in the findAligned method but not in
>                 the findAlignedNext method. This is one
>                 reason findAligned is faster. In order to see real
>                 assembly, the "hsdis-aarch64.dylib" must be present
>                 and it is recommended to use a "fast-debug" version of
>                 the JDK. Read more on Jorn Vernee's blog here:
>                 https://jornvernee.github.io/hsdis/2022/04/30/hsdis.html
>
>
>                 The question then becomes why that is the case. This
>                 drives us into another field of expertise where I am
>                 not the right person to provide an answer. Generally,
>                 there is no guarantee as to how the C2 compiler works
>                 and we are improving it continuously. Maybe someone
>                 else can provide additional information.
>
>                 Best, Per Minborg
>
>
>
>
>                 ------------------------------------------------------------------------
>                 *From:* panama-dev <panama-dev-retn at openjdk.org> on
>                 behalf of Matthias Ernst <matthias at mernst.org>
>                 *Sent:* Wednesday, December 18, 2024 9:26 AM
>                 *To:* panama-dev at openjdk.org <panama-dev at openjdk.org>
>                 *Subject:* performance: arrayElementVarHandle /
>                 calculated index / aligned vs unaligned
>                 Hi,
>
>                 I'm trying to use the foreign memory api to interpret
>                 some variable-length encoded data, where an offset
>                 vector encodes the start offset of each stride.
>                 Accessing element `i` in this case involves reading
>                 `offset[i+1]` in addition to `offset[i]`. The offset
>                 vector is modeled as a
>                 `JAVA_LONG.arrayElementVarHandle()`.
>
>                 Just out of curiosity about bounds and alignment
>                 checks I switched the layout to JAVA_LONG_UNALIGNED
>                 for reading (data is still aligned) and I saw a large
>                 difference in performance where I didn't expect
>                 one, and it seems to boil down to the computed index
>                 `endOffset[i+1]` access, not for the `[i]` case. My
>                 expectation would have been that all variants exhibit
>                 the same performance, since alignment checks would be
>                 moved out of the loop.
>
>                 A micro-benchmark (attached) to demonstrate:
>                 long-aligned memory segment, looping over the same
>                 elements in 6 different ways:
>                 {aligned, unaligned} x {segment[i] , segment[i+1], 
>                 segment[i+1] (w/ base offset) } gives very different
>                 results for aligned[i+1] (but not for aligned[i]):
>
>                 Benchmark                         Mode  Cnt    Score  
>                 Error  Units
>                 Alignment.findAligned            thrpt     217.050    
>                      ops/s
>                 Alignment.findAlignedPlusOne     thrpt     110.366    
>                      ops/s. <= #####
>                 Alignment.findAlignedNext    thrpt 110.377        
>                  ops/s. <= #####
>                 Alignment.findUnaligned          thrpt     216.591    
>                      ops/s
>                 Alignment.findUnalignedPlusOne   thrpt     215.843    
>                      ops/s
>                 Alignment.findUnalignedNext  thrpt     216.483        
>                  ops/s
>
>                 openjdk version "23.0.1" 2024-10-15
>                 OpenJDK Runtime Environment (build 23.0.1+11-39)
>                 OpenJDK 64-Bit Server VM (build 23.0.1+11-39, mixed
>                 mode, sharing)
>                 Macbook Air M3
>
>                 Needless to say that the difference was smaller with
>                 more app code in play, but large enough to give me
>                 pause. Likely it wouldn't matter at all but I want to
>                 have a better idea which design choices to pay
>                 attention to. With the foreign memory api, I find it a
>                 bit difficult to distinguish convenience from
>                 performance-relevant options (e.g. using path
>                 expressions vs computed offsets vs using a base
>                 offset. Besides "make layouts and varhandles static
>                 final" what would be other rules of thumb?)
>
>                 Thx
>                 Matthias
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20250123/d75ccd8a/attachment-0001.htm>