performance: arrayElementVarHandle / calculated index / aligned vs unaligned

Thu Dec 19 14:29:24 UTC 2024

 > Jorn's precompiled dylib

Sorry, what? As far as I know I don't distribute precompiled hsdis 
binaries :D

Jorn

On 19-12-2024 14:41, Matthias Ernst wrote:
> Thanks a lot for rewriting/reproducing!
>
> I've in the meantime tried to take some more complexity out:
> * replaced arrayElementVarHandle (which uses MemoryLayout#scale with 
> exact math) with a "plain" version (`byteSize() * index`, at the risk 
> of silent overflows).
> * I also eliminated the "VarHandle#collectCoordinates(h,1, scale)" in 
> favor of a plain varHandle.get(segment, i * byteSize()), after 
> verifying they have identical performance.
>
> So we're down to a plain "VarHandle.get(segment, i * byteSize)"  in 
> four combinations: ({aligned, unaligned} x {i, i+1}), and the original 
> observation still holds:
> the combo "aligned read" with "i + 1" somehow trips C2:
> Alignment.aligned         avgt       0.151          ns/op
> Alignment.alignedPlus1    avgt       0.298          ns/op
> Alignment.unaligned       avgt       0.153          ns/op
> Alignment.unalignedPlus1  avgt       0.153  ns/op
>
> This leaves us with the assumption that the alignment check for the +1 
> case somehow evades C2 optimization.
> And indeed, we can remove all VarHandle business and only benchmark 
> the alignment check, and see that it fails to recognize that it is 
> loop-invariant:
>
>     // simplified copy of AbstractMemorySegmentImpl#isAlignedForElement
>     private static boolean isAligned(MemorySegment segment, long 
> offset, long byteAlignment) {
>         return (((segment.address() + offset)) & (byteAlignment - 1)) 
> == 0;
>     }
>
>     @Benchmark
>     public void isAligned() {
>         for (long i = 1; i < COUNT; ++i) {
>             if (!isAligned(segment, i * 8, 8)) throw new 
> IllegalArgumentException();
>         }
>     }
>
>     @Benchmark
>     public void isAlignedPlus1() {
>         for (long i = 0; i < COUNT - 1; ++i) {
>             if (!isAligned(segment, (i + 1) * 8, 8)) throw new 
> IllegalArgumentException();
>         }
>     }
>
> =>
>
> Alignment.isAligned       thrpt       35160425.491  ops/ns
> Alignment.isAlignedPlus1  thrpt              7.242  ops/ns
>
> So this seems to be the culprit. Is it an issue? Idk. Using a plain 
> offset multiplication instead of the overflow-protecting 
> MemoryLayout#scale actually seems to have a bigger impact on 
> performance than this.
>
> > hsdis
>
> I actually looked at hsdis and tried to use Jorn's precompiled dylib, 
> but it doesn't seem to load for me. Installing the whole toolchain and 
> building my own is probably beyond what I'm trying to do here (esp 
> since I'm not even sure I could read the assembly...)
>
> Matthias
>
>
> On Wed, Dec 18, 2024 at 3:13 PM Per-Ake Minborg 
> <per-ake.minborg at oracle.com> wrote:
>
>     Hi Matthias!
>
>     I've rewritten the benchmark slightly (just to make them
>     "normalized" the way we use to write them) even though your
>     benchmarks work equally well. See attachment. By using the commands
>
>     jvmArgsAppend = {
>
>            "-XX:+PrintCompilation",      
>      "-XX:+UnlockDiagnosticVMOptions",        "-XX:+PrintInlining" }
>
>     in a @Fork annotation, and observing the output, it appears all
>     the methods are inlined properly. So, even though some methods are
>     more complex, it appears they are treated in the same way when it
>     comes to inlining.
>
>      By looking at the actual assembly generated for the benchmarks
>     using these commands (for an M1 in my case):
>
>     @Fork(value = 1, jvmArgsAppend = {        "-Xbatch",      
>      "-XX:-TieredCompilation",      
>      "-XX:CompileCommand=dontinline,org.openjdk.bench.java.lang.foreign.Alignment::findAligned*",
>          
>      "-XX:CompileCommand=PrintAssembly,org.openjdk.bench.java.lang.foreign.Alignment::findAligned*"
>     })
>
>     I can see that the C2 compiler is able to unroll the segment
>     access in the findAligned method but not in the findAlignedNext
>     method. This is one reason findAligned is faster. In order to see
>     real assembly, the "hsdis-aarch64.dylib" must be present and it is
>     recommended to use a "fast-debug" version of the JDK. Read more on
>     Jorn Vernee's blog here:
>     https://jornvernee.github.io/hsdis/2022/04/30/hsdis.html
>
>
>     The question then becomes why that is the case. This drives us
>     into another field of expertise where I am not the right person to
>     provide an answer. Generally, there is no guarantee as to how the
>     C2 compiler works and we are improving it continuously. Maybe
>     someone else can provide additional information.
>
>     Best, Per Minborg
>
>
>
>
>     ------------------------------------------------------------------------
>     *From:* panama-dev <panama-dev-retn at openjdk.org> on behalf of
>     Matthias Ernst <matthias at mernst.org>
>     *Sent:* Wednesday, December 18, 2024 9:26 AM
>     *To:* panama-dev at openjdk.org <panama-dev at openjdk.org>
>     *Subject:* performance: arrayElementVarHandle / calculated index /
>     aligned vs unaligned
>     Hi,
>
>     I'm trying to use the foreign memory api to interpret some
>     variable-length encoded data, where an offset vector encodes the
>     start offset of each stride. Accessing element `i` in this case
>     involves reading `offset[i+1]` in addition to `offset[i]`. The
>     offset vector is modeled as a `JAVA_LONG.arrayElementVarHandle()`.
>
>     Just out of curiosity about bounds and alignment checks I switched
>     the layout to JAVA_LONG_UNALIGNED for reading (data is still
>     aligned) and I saw a large difference in performance where I
>     didn't expect one, and it seems to boil down to the computed index
>     `endOffset[i+1]` access, not for the `[i]` case. My expectation
>     would have been that all variants exhibit the same
>     performance, since alignment checks would be moved out of the loop.
>
>     A micro-benchmark (attached) to demonstrate:
>     long-aligned memory segment, looping over the same elements in 6
>     different ways:
>     {aligned, unaligned} x {segment[i] , segment[i+1],  segment[i+1]
>     (w/ base offset) } gives very different results for aligned[i+1]
>     (but not for aligned[i]):
>
>     Benchmark                         Mode  Cnt    Score   Error  Units
>     Alignment.findAligned            thrpt       217.050          ops/s
>     Alignment.findAlignedPlusOne     thrpt       110.366        
>      ops/s. <= #####
>     Alignment.findAlignedNext    thrpt       110.377      ops/s. <= #####
>     Alignment.findUnaligned          thrpt       216.591          ops/s
>     Alignment.findUnalignedPlusOne   thrpt       215.843          ops/s
>     Alignment.findUnalignedNext  thrpt       216.483          ops/s
>
>     openjdk version "23.0.1" 2024-10-15
>     OpenJDK Runtime Environment (build 23.0.1+11-39)
>     OpenJDK 64-Bit Server VM (build 23.0.1+11-39, mixed mode, sharing)
>     Macbook Air M3
>
>     Needless to say that the difference was smaller with more app code
>     in play, but large enough to give me pause. Likely it
>     wouldn't matter at all but I want to have a better idea which
>     design choices to pay attention to. With the foreign memory api, I
>     find it a bit difficult to distinguish convenience from
>     performance-relevant options (e.g. using path expressions vs
>     computed offsets vs using a base offset. Besides "make layouts and
>     varhandles static final" what would be other rules of thumb?)
>
>     Thx
>     Matthias
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20241219/af800115/attachment-0001.htm>