performance: arrayElementVarHandle / calculated index / aligned vs unaligned
Matthias Ernst
matthias at mernst.org
Thu Dec 19 13:56:31 UTC 2024
FWIW, when I expand the alignment check like this, C2 starts optimizing it
out of the loop for i+1 as well:
private static boolean isAligned(MemorySegment segment, long offset,
long byteAlignment) {
long mask = byteAlignment - 1;
return (((segment.address() & mask) == 0) && ((offset & mask) == 0))
|| (((segment.address() + offset) & mask) == 0);
}
Benchmark Mode Cnt Score Error Units
Alignment.isAligned avgt ≈ 10⁻⁸ ns/op
Alignment.isAlignedPlus1 avgt ≈ 10⁻⁸ ns/op
On Thu, Dec 19, 2024 at 2:41 PM Matthias Ernst <matthias at mernst.org> wrote:
> Thanks a lot for rewriting/reproducing!
>
> I've in the meantime tried to take some more complexity out:
> * replaced arrayElementVarHandle (which uses MemoryLayout#scale with exact
> math) with a "plain" version (`byteSize() * index`, at the risk of silent
> overflows).
> * I also eliminated the "VarHandle#collectCoordinates(h,1, scale)" in
> favor of a plain varHandle.get(segment, i * byteSize()), after verifying
> they have identical performance.
>
> So we're down to a plain "VarHandle.get(segment, i * byteSize)" in four
> combinations: ({aligned, unaligned} x {i, i+1}), and the original
> observation still holds:
> the combo "aligned read" with "i + 1" somehow trips C2:
> Alignment.aligned avgt 0.151 ns/op
> Alignment.alignedPlus1 avgt 0.298 ns/op
> Alignment.unaligned avgt 0.153 ns/op
> Alignment.unalignedPlus1 avgt 0.153 ns/op
>
> This leaves us with the assumption that the alignment check for the +1
> case somehow evades C2 optimization.
> And indeed, we can remove all VarHandle business and only benchmark the
> alignment check, and see that it fails to recognize that it is
> loop-invariant:
>
> // simplified copy of AbstractMemorySegmentImpl#isAlignedForElement
> private static boolean isAligned(MemorySegment segment, long offset,
> long byteAlignment) {
> return (((segment.address() + offset)) & (byteAlignment - 1)) == 0;
> }
>
> @Benchmark
> public void isAligned() {
> for (long i = 1; i < COUNT; ++i) {
> if (!isAligned(segment, i * 8, 8)) throw new
> IllegalArgumentException();
> }
> }
>
> @Benchmark
> public void isAlignedPlus1() {
> for (long i = 0; i < COUNT - 1; ++i) {
> if (!isAligned(segment, (i + 1) * 8, 8)) throw new
> IllegalArgumentException();
> }
> }
>
> =>
>
> Alignment.isAligned thrpt 35160425.491 ops/ns
> Alignment.isAlignedPlus1 thrpt 7.242 ops/ns
>
> So this seems to be the culprit. Is it an issue? Idk. Using a plain offset
> multiplication instead of the overflow-protecting MemoryLayout#scale
> actually seems to have a bigger impact on performance than this.
>
> > hsdis
>
> I actually looked at hsdis and tried to use Jorn's precompiled dylib, but
> it doesn't seem to load for me. Installing the whole toolchain and building
> my own is probably beyond what I'm trying to do here (esp since I'm not
> even sure I could read the assembly...)
>
> Matthias
>
>
> On Wed, Dec 18, 2024 at 3:13 PM Per-Ake Minborg <
> per-ake.minborg at oracle.com> wrote:
>
>> Hi Matthias!
>>
>> I've rewritten the benchmark slightly (just to make them "normalized" the
>> way we use to write them) even though your benchmarks work equally well.
>> See attachment. By using the commands
>>
>> jvmArgsAppend = {
>>
>> "-XX:+PrintCompilation",
>> "-XX:+UnlockDiagnosticVMOptions",
>> "-XX:+PrintInlining" }
>>
>> in a @Fork annotation, and observing the output, it appears all the
>> methods are inlined properly. So, even though some methods are more
>> complex, it appears they are treated in the same way when it comes to
>> inlining.
>>
>> By looking at the actual assembly generated for the benchmarks using
>> these commands (for an M1 in my case):
>>
>> @Fork(value = 1, jvmArgsAppend = {
>> "-Xbatch",
>> "-XX:-TieredCompilation",
>> "-XX:CompileCommand=dontinline,org.openjdk.bench.java.lang.foreign.Alignment::findAligned*",
>> "-XX:CompileCommand=PrintAssembly,org.openjdk.bench.java.lang.foreign.Alignment::findAligned*"
>> })
>>
>> I can see that the C2 compiler is able to unroll the segment access in
>> the findAligned method but not in the findAlignedNext method. This is
>> one reason findAligned is faster. In order to see real assembly, the
>> "hsdis-aarch64.dylib" must be present and it is recommended to use a
>> "fast-debug" version of the JDK. Read more on Jorn Vernee's blog here:
>> https://jornvernee.github.io/hsdis/2022/04/30/hsdis.html
>>
>>
>> The question then becomes why that is the case. This drives us into
>> another field of expertise where I am not the right person to provide an
>> answer. Generally, there is no guarantee as to how the C2 compiler works
>> and we are improving it continuously. Maybe someone else can provide
>> additional information.
>>
>> Best, Per Minborg
>>
>>
>>
>>
>> ------------------------------
>> *From:* panama-dev <panama-dev-retn at openjdk.org> on behalf of Matthias
>> Ernst <matthias at mernst.org>
>> *Sent:* Wednesday, December 18, 2024 9:26 AM
>> *To:* panama-dev at openjdk.org <panama-dev at openjdk.org>
>> *Subject:* performance: arrayElementVarHandle / calculated index /
>> aligned vs unaligned
>>
>> Hi,
>>
>> I'm trying to use the foreign memory api to interpret some
>> variable-length encoded data, where an offset vector encodes the start
>> offset of each stride. Accessing element `i` in this case involves reading
>> `offset[i+1]` in addition to `offset[i]`. The offset vector is modeled as a
>> `JAVA_LONG.arrayElementVarHandle()`.
>>
>> Just out of curiosity about bounds and alignment checks I switched the
>> layout to JAVA_LONG_UNALIGNED for reading (data is still aligned) and I saw
>> a large difference in performance where I didn't expect one, and it seems
>> to boil down to the computed index `endOffset[i+1]` access, not for the
>> `[i]` case. My expectation would have been that all variants exhibit the
>> same performance, since alignment checks would be moved out of the loop.
>>
>> A micro-benchmark (attached) to demonstrate:
>> long-aligned memory segment, looping over the same elements in 6
>> different ways:
>> {aligned, unaligned} x {segment[i] , segment[i+1], segment[i+1] (w/ base
>> offset) } gives very different results for aligned[i+1] (but not for
>> aligned[i]):
>>
>> Benchmark Mode Cnt Score Error Units
>> Alignment.findAligned thrpt 217.050 ops/s
>> Alignment.findAlignedPlusOne thrpt 110.366 ops/s. <=
>> #####
>> Alignment.findAlignedNext thrpt 110.377 ops/s. <= #####
>> Alignment.findUnaligned thrpt 216.591 ops/s
>> Alignment.findUnalignedPlusOne thrpt 215.843 ops/s
>> Alignment.findUnalignedNext thrpt 216.483 ops/s
>>
>> openjdk version "23.0.1" 2024-10-15
>> OpenJDK Runtime Environment (build 23.0.1+11-39)
>> OpenJDK 64-Bit Server VM (build 23.0.1+11-39, mixed mode, sharing)
>> Macbook Air M3
>>
>> Needless to say that the difference was smaller with more app code in
>> play, but large enough to give me pause. Likely it wouldn't matter at all
>> but I want to have a better idea which design choices to pay attention to.
>> With the foreign memory api, I find it a bit difficult to distinguish
>> convenience from performance-relevant options (e.g. using path expressions
>> vs computed offsets vs using a base offset. Besides "make layouts and
>> varhandles static final" what would be other rules of thumb?)
>>
>> Thx
>> Matthias
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20241219/2bef463b/attachment.htm>
More information about the panama-dev
mailing list