performance: arrayElementVarHandle / calculated index / aligned vs unaligned

Thu Jan 23 07:04:04 UTC 2025

Could someone help me move this fix (
https://github.com/openjdk/jdk/pull/22856) over the finish line? I don't
think we should leave this performance on the table for FFM var handles.
Thanks!

On Sat, Dec 21, 2024 at 3:17 PM Matthias Ernst <matthias at mernst.org> wrote:

> I see there's a new issue for this:
> https://bugs.openjdk.org/browse/JDK-8346664.
> Started working on a fix: https://github.com/openjdk/jdk/pull/22856
>
>
> On Fri, Dec 20, 2024 at 1:56 PM Matthias Ernst <matthias at mernst.org>
> wrote:
>
>> > alignment check for the +1 case somehow evades C2 optimization.
>>
>> I believe I may understand how this is happening (disclaimer: chemistry
>> dog reading openjdk source, apologies if I'm missing something):
>>
>> A dedicated transformation to simplify (base + offset << shift) & mask
>> type expressions was actually introduced for Panama in
>> https://github.com/openjdk/jdk/pull/6697:
>>
>> https://github.com/openjdk/jdk/blob/cf28fd4cbc6507eb69fcfeb33622316eb5b6b0c5/src/hotspot/share/opto/mulnode.cpp#L2128
>> It requires the expression to be a variant of AND(ADD(..., SHIFT(offset,
>> shift), mask).
>>
>> This is what turns (offset + i << 8) & 7 into a loop-invariant.
>>
>> However, before this pattern is checked, a "shift" node like ((i+1) << 8)
>> gets expanded into `i << 8 + 8` here:
>> https://github.com/openjdk/jdk/blob/cf28fd4cbc6507eb69fcfeb33622316eb5b6b0c5/src/hotspot/share/opto/mulnode.cpp#L961
>>
>> Now the node contains nested ADDs and no longer matches the
>> pattern: AND(ADD(..., ADD(SHIFT(offset, shift), shift)), mask) .
>>
>> We can defeat the expansion by using a non-constant "1":
>> class Aligmnent {
>>   long one = 1;
>>   handle.get(segment, (i+one) << 8) // magically faster than (i+1)
>> }
>>
>> For a fix, one could possibly make AndIL_shift_and_mask_is_always_zero
>> recursively descend into the ADD tree.
>>
>>
>> On Thu, Dec 19, 2024 at 2:41 PM Matthias Ernst <matthias at mernst.org>
>> wrote:
>>
>>> Thanks a lot for rewriting/reproducing!
>>>
>>> I've in the meantime tried to take some more complexity out:
>>> * replaced arrayElementVarHandle (which uses MemoryLayout#scale with
>>> exact math) with a "plain" version (`byteSize() * index`, at the risk of
>>> silent overflows).
>>> * I also eliminated the "VarHandle#collectCoordinates(h,1, scale)" in
>>> favor of a plain varHandle.get(segment, i * byteSize()), after verifying
>>> they have identical performance.
>>>
>>> So we're down to a plain "VarHandle.get(segment, i * byteSize)"  in four
>>> combinations: ({aligned, unaligned} x {i, i+1}), and the original
>>> observation still holds:
>>> the combo "aligned read" with "i + 1" somehow trips C2:
>>> Alignment.aligned         avgt       0.151          ns/op
>>> Alignment.alignedPlus1    avgt       0.298          ns/op
>>> Alignment.unaligned       avgt       0.153          ns/op
>>> Alignment.unalignedPlus1  avgt       0.153          ns/op
>>>
>>> This leaves us with the assumption that the alignment check for the +1
>>> case somehow evades C2 optimization.
>>> And indeed, we can remove all VarHandle business and only benchmark the
>>> alignment check, and see that it fails to recognize that it is
>>> loop-invariant:
>>>
>>>     // simplified copy of AbstractMemorySegmentImpl#isAlignedForElement
>>>     private static boolean isAligned(MemorySegment segment, long offset,
>>> long byteAlignment) {
>>>         return (((segment.address() + offset)) & (byteAlignment - 1)) ==
>>> 0;
>>>     }
>>>
>>>     @Benchmark
>>>     public void isAligned() {
>>>         for (long i = 1; i < COUNT; ++i) {
>>>             if (!isAligned(segment, i * 8, 8)) throw new
>>> IllegalArgumentException();
>>>         }
>>>     }
>>>
>>>     @Benchmark
>>>     public void isAlignedPlus1() {
>>>         for (long i = 0; i < COUNT - 1; ++i) {
>>>             if (!isAligned(segment, (i + 1) * 8, 8)) throw new
>>> IllegalArgumentException();
>>>         }
>>>     }
>>>
>>> =>
>>>
>>> Alignment.isAligned       thrpt       35160425.491          ops/ns
>>> Alignment.isAlignedPlus1  thrpt              7.242          ops/ns
>>>
>>> So this seems to be the culprit. Is it an issue? Idk. Using a plain
>>> offset multiplication instead of the overflow-protecting MemoryLayout#scale
>>> actually seems to have a bigger impact on performance than this.
>>>
>>> > hsdis
>>>
>>> I actually looked at hsdis and tried to use Jorn's precompiled dylib,
>>> but it doesn't seem to load for me. Installing the whole toolchain and
>>> building my own is probably beyond what I'm trying to do here (esp since
>>> I'm not even sure I could read the assembly...)
>>>
>>> Matthias
>>>
>>>
>>> On Wed, Dec 18, 2024 at 3:13 PM Per-Ake Minborg <
>>> per-ake.minborg at oracle.com> wrote:
>>>
>>>> Hi Matthias!
>>>>
>>>> I've rewritten the benchmark slightly (just to make them "normalized"
>>>> the way we use to write them) even though your benchmarks work equally
>>>> well. See attachment. By using the commands
>>>>
>>>> jvmArgsAppend = {
>>>>
>>>>         "-XX:+PrintCompilation",
>>>>         "-XX:+UnlockDiagnosticVMOptions",
>>>>         "-XX:+PrintInlining" }
>>>>
>>>> in a @Fork annotation, and observing the output, it appears all the
>>>> methods are inlined properly. So, even though some methods are more
>>>> complex, it appears they are treated in the same way when it comes to
>>>> inlining.
>>>>
>>>>  By looking at the actual assembly generated for the benchmarks using
>>>> these commands (for an M1 in my case):
>>>>
>>>> @Fork(value = 1, jvmArgsAppend = {
>>>>         "-Xbatch",
>>>>         "-XX:-TieredCompilation",
>>>>         "-XX:CompileCommand=dontinline,org.openjdk.bench.java.lang.foreign.Alignment::findAligned*",
>>>>         "-XX:CompileCommand=PrintAssembly,org.openjdk.bench.java.lang.foreign.Alignment::findAligned*"
>>>> })
>>>>
>>>> I can see that the C2 compiler is able to unroll the segment access in
>>>> the findAligned method but not in the findAlignedNext method. This is
>>>> one reason findAligned is faster. In order to see real assembly, the
>>>> "hsdis-aarch64.dylib" must be present and it is recommended to use a
>>>> "fast-debug" version of the JDK. Read more on Jorn Vernee's blog here:
>>>> https://jornvernee.github.io/hsdis/2022/04/30/hsdis.html
>>>>
>>>>
>>>> The question then becomes why that is the case. This drives us into
>>>> another field of expertise where I am not the right person to provide an
>>>> answer. Generally, there is no guarantee as to how the C2 compiler works
>>>> and we are improving it continuously. Maybe someone else can provide
>>>> additional information.
>>>>
>>>> Best, Per Minborg
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------
>>>> *From:* panama-dev <panama-dev-retn at openjdk.org> on behalf of Matthias
>>>> Ernst <matthias at mernst.org>
>>>> *Sent:* Wednesday, December 18, 2024 9:26 AM
>>>> *To:* panama-dev at openjdk.org <panama-dev at openjdk.org>
>>>> *Subject:* performance: arrayElementVarHandle / calculated index /
>>>> aligned vs unaligned
>>>>
>>>> Hi,
>>>>
>>>> I'm trying to use the foreign memory api to interpret some
>>>> variable-length encoded data, where an offset vector encodes the start
>>>> offset of each stride. Accessing element `i` in this case involves reading
>>>> `offset[i+1]` in addition to `offset[i]`. The offset vector is modeled as a
>>>> `JAVA_LONG.arrayElementVarHandle()`.
>>>>
>>>> Just out of curiosity about bounds and alignment checks I switched the
>>>> layout to JAVA_LONG_UNALIGNED for reading (data is still aligned) and I saw
>>>> a large difference in performance where I didn't expect one, and it seems
>>>> to boil down to the computed index `endOffset[i+1]` access, not for the
>>>> `[i]` case. My expectation would have been that all variants exhibit the
>>>> same performance, since alignment checks would be moved out of the loop.
>>>>
>>>> A micro-benchmark (attached) to demonstrate:
>>>> long-aligned memory segment, looping over the same elements in 6
>>>> different ways:
>>>> {aligned, unaligned} x {segment[i] , segment[i+1],  segment[i+1] (w/
>>>> base offset) } gives very different results for aligned[i+1] (but not for
>>>> aligned[i]):
>>>>
>>>> Benchmark                         Mode  Cnt    Score   Error  Units
>>>> Alignment.findAligned            thrpt       217.050          ops/s
>>>> Alignment.findAlignedPlusOne     thrpt       110.366          ops/s. <=
>>>> #####
>>>> Alignment.findAlignedNext    thrpt       110.377          ops/s. <=
>>>> #####
>>>> Alignment.findUnaligned          thrpt       216.591          ops/s
>>>> Alignment.findUnalignedPlusOne   thrpt       215.843          ops/s
>>>> Alignment.findUnalignedNext  thrpt       216.483          ops/s
>>>>
>>>> openjdk version "23.0.1" 2024-10-15
>>>> OpenJDK Runtime Environment (build 23.0.1+11-39)
>>>> OpenJDK 64-Bit Server VM (build 23.0.1+11-39, mixed mode, sharing)
>>>> Macbook Air M3
>>>>
>>>> Needless to say that the difference was smaller with more app code in
>>>> play, but large enough to give me pause. Likely it wouldn't matter at all
>>>> but I want to have a better idea which design choices to pay attention to.
>>>> With the foreign memory api, I find it a bit difficult to distinguish
>>>> convenience from performance-relevant options (e.g. using path expressions
>>>> vs computed offsets vs using a base offset. Besides "make layouts and
>>>> varhandles static final" what would be other rules of thumb?)
>>>>
>>>> Thx
>>>> Matthias
>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20250123/366dbfeb/attachment-0001.htm>