performance: arrayElementVarHandle / calculated index / aligned vs unaligned

Sat Dec 21 14:17:53 UTC 2024

I see there's a new issue for this:
https://bugs.openjdk.org/browse/JDK-8346664.
Started working on a fix: https://github.com/openjdk/jdk/pull/22856

On Fri, Dec 20, 2024 at 1:56 PM Matthias Ernst <matthias at mernst.org> wrote:

> > alignment check for the +1 case somehow evades C2 optimization.
>
> I believe I may understand how this is happening (disclaimer: chemistry
> dog reading openjdk source, apologies if I'm missing something):
>
> A dedicated transformation to simplify (base + offset << shift) & mask
> type expressions was actually introduced for Panama in
> https://github.com/openjdk/jdk/pull/6697:
>
> https://github.com/openjdk/jdk/blob/cf28fd4cbc6507eb69fcfeb33622316eb5b6b0c5/src/hotspot/share/opto/mulnode.cpp#L2128
> It requires the expression to be a variant of AND(ADD(..., SHIFT(offset,
> shift), mask).
>
> This is what turns (offset + i << 8) & 7 into a loop-invariant.
>
> However, before this pattern is checked, a "shift" node like ((i+1) << 8)
> gets expanded into `i << 8 + 8` here:
> https://github.com/openjdk/jdk/blob/cf28fd4cbc6507eb69fcfeb33622316eb5b6b0c5/src/hotspot/share/opto/mulnode.cpp#L961
>
> Now the node contains nested ADDs and no longer matches the
> pattern: AND(ADD(..., ADD(SHIFT(offset, shift), shift)), mask) .
>
> We can defeat the expansion by using a non-constant "1":
> class Aligmnent {
>   long one = 1;
>   handle.get(segment, (i+one) << 8) // magically faster than (i+1)
> }
>
> For a fix, one could possibly make AndIL_shift_and_mask_is_always_zero
> recursively descend into the ADD tree.
>
>
> On Thu, Dec 19, 2024 at 2:41 PM Matthias Ernst <matthias at mernst.org>
> wrote:
>
>> Thanks a lot for rewriting/reproducing!
>>
>> I've in the meantime tried to take some more complexity out:
>> * replaced arrayElementVarHandle (which uses MemoryLayout#scale with
>> exact math) with a "plain" version (`byteSize() * index`, at the risk of
>> silent overflows).
>> * I also eliminated the "VarHandle#collectCoordinates(h,1, scale)" in
>> favor of a plain varHandle.get(segment, i * byteSize()), after verifying
>> they have identical performance.
>>
>> So we're down to a plain "VarHandle.get(segment, i * byteSize)"  in four
>> combinations: ({aligned, unaligned} x {i, i+1}), and the original
>> observation still holds:
>> the combo "aligned read" with "i + 1" somehow trips C2:
>> Alignment.aligned         avgt       0.151          ns/op
>> Alignment.alignedPlus1    avgt       0.298          ns/op
>> Alignment.unaligned       avgt       0.153          ns/op
>> Alignment.unalignedPlus1  avgt       0.153          ns/op
>>
>> This leaves us with the assumption that the alignment check for the +1
>> case somehow evades C2 optimization.
>> And indeed, we can remove all VarHandle business and only benchmark the
>> alignment check, and see that it fails to recognize that it is
>> loop-invariant:
>>
>>     // simplified copy of AbstractMemorySegmentImpl#isAlignedForElement
>>     private static boolean isAligned(MemorySegment segment, long offset,
>> long byteAlignment) {
>>         return (((segment.address() + offset)) & (byteAlignment - 1)) ==
>> 0;
>>     }
>>
>>     @Benchmark
>>     public void isAligned() {
>>         for (long i = 1; i < COUNT; ++i) {
>>             if (!isAligned(segment, i * 8, 8)) throw new
>> IllegalArgumentException();
>>         }
>>     }
>>
>>     @Benchmark
>>     public void isAlignedPlus1() {
>>         for (long i = 0; i < COUNT - 1; ++i) {
>>             if (!isAligned(segment, (i + 1) * 8, 8)) throw new
>> IllegalArgumentException();
>>         }
>>     }
>>
>> =>
>>
>> Alignment.isAligned       thrpt       35160425.491          ops/ns
>> Alignment.isAlignedPlus1  thrpt              7.242          ops/ns
>>
>> So this seems to be the culprit. Is it an issue? Idk. Using a plain
>> offset multiplication instead of the overflow-protecting MemoryLayout#scale
>> actually seems to have a bigger impact on performance than this.
>>
>> > hsdis
>>
>> I actually looked at hsdis and tried to use Jorn's precompiled dylib, but
>> it doesn't seem to load for me. Installing the whole toolchain and building
>> my own is probably beyond what I'm trying to do here (esp since I'm not
>> even sure I could read the assembly...)
>>
>> Matthias
>>
>>
>> On Wed, Dec 18, 2024 at 3:13 PM Per-Ake Minborg <
>> per-ake.minborg at oracle.com> wrote:
>>
>>> Hi Matthias!
>>>
>>> I've rewritten the benchmark slightly (just to make them "normalized"
>>> the way we use to write them) even though your benchmarks work equally
>>> well. See attachment. By using the commands
>>>
>>> jvmArgsAppend = {
>>>
>>>         "-XX:+PrintCompilation",
>>>         "-XX:+UnlockDiagnosticVMOptions",
>>>         "-XX:+PrintInlining" }
>>>
>>> in a @Fork annotation, and observing the output, it appears all the
>>> methods are inlined properly. So, even though some methods are more
>>> complex, it appears they are treated in the same way when it comes to
>>> inlining.
>>>
>>>  By looking at the actual assembly generated for the benchmarks using
>>> these commands (for an M1 in my case):
>>>
>>> @Fork(value = 1, jvmArgsAppend = {
>>>         "-Xbatch",
>>>         "-XX:-TieredCompilation",
>>>         "-XX:CompileCommand=dontinline,org.openjdk.bench.java.lang.foreign.Alignment::findAligned*",
>>>         "-XX:CompileCommand=PrintAssembly,org.openjdk.bench.java.lang.foreign.Alignment::findAligned*"
>>> })
>>>
>>> I can see that the C2 compiler is able to unroll the segment access in
>>> the findAligned method but not in the findAlignedNext method. This is
>>> one reason findAligned is faster. In order to see real assembly, the
>>> "hsdis-aarch64.dylib" must be present and it is recommended to use a
>>> "fast-debug" version of the JDK. Read more on Jorn Vernee's blog here:
>>> https://jornvernee.github.io/hsdis/2022/04/30/hsdis.html
>>>
>>>
>>> The question then becomes why that is the case. This drives us into
>>> another field of expertise where I am not the right person to provide an
>>> answer. Generally, there is no guarantee as to how the C2 compiler works
>>> and we are improving it continuously. Maybe someone else can provide
>>> additional information.
>>>
>>> Best, Per Minborg
>>>
>>>
>>>
>>>
>>> ------------------------------
>>> *From:* panama-dev <panama-dev-retn at openjdk.org> on behalf of Matthias
>>> Ernst <matthias at mernst.org>
>>> *Sent:* Wednesday, December 18, 2024 9:26 AM
>>> *To:* panama-dev at openjdk.org <panama-dev at openjdk.org>
>>> *Subject:* performance: arrayElementVarHandle / calculated index /
>>> aligned vs unaligned
>>>
>>> Hi,
>>>
>>> I'm trying to use the foreign memory api to interpret some
>>> variable-length encoded data, where an offset vector encodes the start
>>> offset of each stride. Accessing element `i` in this case involves reading
>>> `offset[i+1]` in addition to `offset[i]`. The offset vector is modeled as a
>>> `JAVA_LONG.arrayElementVarHandle()`.
>>>
>>> Just out of curiosity about bounds and alignment checks I switched the
>>> layout to JAVA_LONG_UNALIGNED for reading (data is still aligned) and I saw
>>> a large difference in performance where I didn't expect one, and it seems
>>> to boil down to the computed index `endOffset[i+1]` access, not for the
>>> `[i]` case. My expectation would have been that all variants exhibit the
>>> same performance, since alignment checks would be moved out of the loop.
>>>
>>> A micro-benchmark (attached) to demonstrate:
>>> long-aligned memory segment, looping over the same elements in 6
>>> different ways:
>>> {aligned, unaligned} x {segment[i] , segment[i+1],  segment[i+1] (w/
>>> base offset) } gives very different results for aligned[i+1] (but not for
>>> aligned[i]):
>>>
>>> Benchmark                         Mode  Cnt    Score   Error  Units
>>> Alignment.findAligned            thrpt       217.050          ops/s
>>> Alignment.findAlignedPlusOne     thrpt       110.366          ops/s. <=
>>> #####
>>> Alignment.findAlignedNext    thrpt       110.377          ops/s. <= #####
>>> Alignment.findUnaligned          thrpt       216.591          ops/s
>>> Alignment.findUnalignedPlusOne   thrpt       215.843          ops/s
>>> Alignment.findUnalignedNext  thrpt       216.483          ops/s
>>>
>>> openjdk version "23.0.1" 2024-10-15
>>> OpenJDK Runtime Environment (build 23.0.1+11-39)
>>> OpenJDK 64-Bit Server VM (build 23.0.1+11-39, mixed mode, sharing)
>>> Macbook Air M3
>>>
>>> Needless to say that the difference was smaller with more app code in
>>> play, but large enough to give me pause. Likely it wouldn't matter at all
>>> but I want to have a better idea which design choices to pay attention to.
>>> With the foreign memory api, I find it a bit difficult to distinguish
>>> convenience from performance-relevant options (e.g. using path expressions
>>> vs computed offsets vs using a base offset. Besides "make layouts and
>>> varhandles static final" what would be other rules of thumb?)
>>>
>>> Thx
>>> Matthias
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20241221/223f8914/attachment-0001.htm>