performance: arrayElementVarHandle / calculated index / aligned vs unaligned

Fri Dec 20 12:56:53 UTC 2024

> alignment check for the +1 case somehow evades C2 optimization.

I believe I may understand how this is happening (disclaimer: chemistry dog
reading openjdk source, apologies if I'm missing something):

A dedicated transformation to simplify (base + offset << shift) & mask type
expressions was actually introduced for Panama in
https://github.com/openjdk/jdk/pull/6697:
https://github.com/openjdk/jdk/blob/cf28fd4cbc6507eb69fcfeb33622316eb5b6b0c5/src/hotspot/share/opto/mulnode.cpp#L2128
It requires the expression to be a variant of AND(ADD(..., SHIFT(offset,
shift), mask).

This is what turns (offset + i << 8) & 7 into a loop-invariant.

However, before this pattern is checked, a "shift" node like ((i+1) << 8)
gets expanded into `i << 8 + 8` here:
https://github.com/openjdk/jdk/blob/cf28fd4cbc6507eb69fcfeb33622316eb5b6b0c5/src/hotspot/share/opto/mulnode.cpp#L961

Now the node contains nested ADDs and no longer matches the
pattern: AND(ADD(..., ADD(SHIFT(offset, shift), shift)), mask) .

We can defeat the expansion by using a non-constant "1":
class Aligmnent {
  long one = 1;
  handle.get(segment, (i+one) << 8) // magically faster than (i+1)
}

For a fix, one could possibly make AndIL_shift_and_mask_is_always_zero
recursively descend into the ADD tree.

On Thu, Dec 19, 2024 at 2:41 PM Matthias Ernst <matthias at mernst.org> wrote:

> Thanks a lot for rewriting/reproducing!
>
> I've in the meantime tried to take some more complexity out:
> * replaced arrayElementVarHandle (which uses MemoryLayout#scale with exact
> math) with a "plain" version (`byteSize() * index`, at the risk of silent
> overflows).
> * I also eliminated the "VarHandle#collectCoordinates(h,1, scale)" in
> favor of a plain varHandle.get(segment, i * byteSize()), after verifying
> they have identical performance.
>
> So we're down to a plain "VarHandle.get(segment, i * byteSize)"  in four
> combinations: ({aligned, unaligned} x {i, i+1}), and the original
> observation still holds:
> the combo "aligned read" with "i + 1" somehow trips C2:
> Alignment.aligned         avgt       0.151          ns/op
> Alignment.alignedPlus1    avgt       0.298          ns/op
> Alignment.unaligned       avgt       0.153          ns/op
> Alignment.unalignedPlus1  avgt       0.153          ns/op
>
> This leaves us with the assumption that the alignment check for the +1
> case somehow evades C2 optimization.
> And indeed, we can remove all VarHandle business and only benchmark the
> alignment check, and see that it fails to recognize that it is
> loop-invariant:
>
>     // simplified copy of AbstractMemorySegmentImpl#isAlignedForElement
>     private static boolean isAligned(MemorySegment segment, long offset,
> long byteAlignment) {
>         return (((segment.address() + offset)) & (byteAlignment - 1)) == 0;
>     }
>
>     @Benchmark
>     public void isAligned() {
>         for (long i = 1; i < COUNT; ++i) {
>             if (!isAligned(segment, i * 8, 8)) throw new
> IllegalArgumentException();
>         }
>     }
>
>     @Benchmark
>     public void isAlignedPlus1() {
>         for (long i = 0; i < COUNT - 1; ++i) {
>             if (!isAligned(segment, (i + 1) * 8, 8)) throw new
> IllegalArgumentException();
>         }
>     }
>
> =>
>
> Alignment.isAligned       thrpt       35160425.491          ops/ns
> Alignment.isAlignedPlus1  thrpt              7.242          ops/ns
>
> So this seems to be the culprit. Is it an issue? Idk. Using a plain offset
> multiplication instead of the overflow-protecting MemoryLayout#scale
> actually seems to have a bigger impact on performance than this.
>
> > hsdis
>
> I actually looked at hsdis and tried to use Jorn's precompiled dylib, but
> it doesn't seem to load for me. Installing the whole toolchain and building
> my own is probably beyond what I'm trying to do here (esp since I'm not
> even sure I could read the assembly...)
>
> Matthias
>
>
> On Wed, Dec 18, 2024 at 3:13 PM Per-Ake Minborg <
> per-ake.minborg at oracle.com> wrote:
>
>> Hi Matthias!
>>
>> I've rewritten the benchmark slightly (just to make them "normalized" the
>> way we use to write them) even though your benchmarks work equally well.
>> See attachment. By using the commands
>>
>> jvmArgsAppend = {
>>
>>         "-XX:+PrintCompilation",
>>         "-XX:+UnlockDiagnosticVMOptions",
>>         "-XX:+PrintInlining" }
>>
>> in a @Fork annotation, and observing the output, it appears all the
>> methods are inlined properly. So, even though some methods are more
>> complex, it appears they are treated in the same way when it comes to
>> inlining.
>>
>>  By looking at the actual assembly generated for the benchmarks using
>> these commands (for an M1 in my case):
>>
>> @Fork(value = 1, jvmArgsAppend = {
>>         "-Xbatch",
>>         "-XX:-TieredCompilation",
>>         "-XX:CompileCommand=dontinline,org.openjdk.bench.java.lang.foreign.Alignment::findAligned*",
>>         "-XX:CompileCommand=PrintAssembly,org.openjdk.bench.java.lang.foreign.Alignment::findAligned*"
>> })
>>
>> I can see that the C2 compiler is able to unroll the segment access in
>> the findAligned method but not in the findAlignedNext method. This is
>> one reason findAligned is faster. In order to see real assembly, the
>> "hsdis-aarch64.dylib" must be present and it is recommended to use a
>> "fast-debug" version of the JDK. Read more on Jorn Vernee's blog here:
>> https://jornvernee.github.io/hsdis/2022/04/30/hsdis.html
>>
>>
>> The question then becomes why that is the case. This drives us into
>> another field of expertise where I am not the right person to provide an
>> answer. Generally, there is no guarantee as to how the C2 compiler works
>> and we are improving it continuously. Maybe someone else can provide
>> additional information.
>>
>> Best, Per Minborg
>>
>>
>>
>>
>> ------------------------------
>> *From:* panama-dev <panama-dev-retn at openjdk.org> on behalf of Matthias
>> Ernst <matthias at mernst.org>
>> *Sent:* Wednesday, December 18, 2024 9:26 AM
>> *To:* panama-dev at openjdk.org <panama-dev at openjdk.org>
>> *Subject:* performance: arrayElementVarHandle / calculated index /
>> aligned vs unaligned
>>
>> Hi,
>>
>> I'm trying to use the foreign memory api to interpret some
>> variable-length encoded data, where an offset vector encodes the start
>> offset of each stride. Accessing element `i` in this case involves reading
>> `offset[i+1]` in addition to `offset[i]`. The offset vector is modeled as a
>> `JAVA_LONG.arrayElementVarHandle()`.
>>
>> Just out of curiosity about bounds and alignment checks I switched the
>> layout to JAVA_LONG_UNALIGNED for reading (data is still aligned) and I saw
>> a large difference in performance where I didn't expect one, and it seems
>> to boil down to the computed index `endOffset[i+1]` access, not for the
>> `[i]` case. My expectation would have been that all variants exhibit the
>> same performance, since alignment checks would be moved out of the loop.
>>
>> A micro-benchmark (attached) to demonstrate:
>> long-aligned memory segment, looping over the same elements in 6
>> different ways:
>> {aligned, unaligned} x {segment[i] , segment[i+1],  segment[i+1] (w/ base
>> offset) } gives very different results for aligned[i+1] (but not for
>> aligned[i]):
>>
>> Benchmark                         Mode  Cnt    Score   Error  Units
>> Alignment.findAligned            thrpt       217.050          ops/s
>> Alignment.findAlignedPlusOne     thrpt       110.366          ops/s. <=
>> #####
>> Alignment.findAlignedNext    thrpt       110.377          ops/s. <= #####
>> Alignment.findUnaligned          thrpt       216.591          ops/s
>> Alignment.findUnalignedPlusOne   thrpt       215.843          ops/s
>> Alignment.findUnalignedNext  thrpt       216.483          ops/s
>>
>> openjdk version "23.0.1" 2024-10-15
>> OpenJDK Runtime Environment (build 23.0.1+11-39)
>> OpenJDK 64-Bit Server VM (build 23.0.1+11-39, mixed mode, sharing)
>> Macbook Air M3
>>
>> Needless to say that the difference was smaller with more app code in
>> play, but large enough to give me pause. Likely it wouldn't matter at all
>> but I want to have a better idea which design choices to pay attention to.
>> With the foreign memory api, I find it a bit difficult to distinguish
>> convenience from performance-relevant options (e.g. using path expressions
>> vs computed offsets vs using a base offset. Besides "make layouts and
>> varhandles static final" what would be other rules of thumb?)
>>
>> Thx
>> Matthias
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20241220/2b847dda/attachment-0001.htm>