Array addition and array sum Panama benchmarks
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Wed Mar 20 11:46:36 UTC 2024
Hi,
I did some more analysis of your benchmark. Most of the non-unrolled,
non-vectorized benchmarks perform similarly. Looking at the generated
assembly, it seems that bound checks are correctly hoisted outside the
loop. So the suggestion I made earlier is not going to make any
difference - in fact the scalarSegmentSegment version is already
competitive with scalarUnsafeUnsafe.
But there is indeed a big difference between scalarUnsafeUnsafe and
everything else. The assembly indicates that this version is the only
one that uses vectorized instructions (e.g. addpd instead of addsd).
This probably indicates that /some/ optimization is failing (across the
board, except in that specific case), but I’m not a C2 expert, so I
can’t point to the underlying cause.
I’m adding Vlad and Roland, as they might have a better idea of what’s
going on.
To make things easier to test, I’ve put together a jdk branch which
includes a bunch of relevant benchmarks, including a subset of your
AddBenchmark:
https://github.com/openjdk/jdk/compare/master...mcimadamore:jdk:AddBenchmark?expand=1
I think a good place to start would be to explain the difference between
scalarUnsafeArray and scalarUnsafeUnsafe. The other benchmark can be
looked at later, as (after looking at the assembly) I don’t think this
is an issue that is specific to FFM.
For the records, here are the results I get:
|Benchmark Mode Cnt Score Error Units AddBenchmark.scalarArrayArray avgt
30 94.475 ± 0.554 ns/op AddBenchmark.scalarArrayArrayLongStride avgt 30
481.030 ± 4.477 ns/op AddBenchmark.scalarBufferArray avgt 30 339.244 ±
4.546 ns/op AddBenchmark.scalarBufferBuffer avgt 30 329.813 ± 2.504
ns/op AddBenchmark.scalarSegmentArray avgt 30 376.254 ± 5.192 ns/op
AddBenchmark.scalarSegmentSegment avgt 30 302.793 ± 4.767 ns/op
AddBenchmark.scalarSegmentSegmentLongStride avgt 30 305.078 ± 4.252
ns/op AddBenchmark.scalarUnsafeArray avgt 30 95.765 ± 1.295 ns/op
AddBenchmark.scalarUnsafeUnsafe avgt 30 358.060 ± 4.868 ns/op |
Cheers
Maurizio
On 20/03/2024 10:26, Maurizio Cimadamore wrote:
> Hi Antoine,
> thanks for the benchmark. From the numbers you are getting in the
> AddBenchmark, my gut feeling is that, for memory segments, bound
> checks are not being hoisted outside the loop. That would cause the
> kind of degradation you are seeing here. I'm also surprised to see,
> for that benchmark, that Unsafe is > 2x faster than using plain
> arrays, after all the size of the array is a loop invariant, and no
> check should occur there. On top of my head, I recall a similar issue
> with a benchmark in our repository [1] (you will probably recognize
> the shape there, as it's very similar to yours). In that case, to get
> to optimal performance, some extra casts to `long` needed to be added
> as C2 cannot yet optimize loops with that particular shape. Note that
> all the bound check analysis on memory segments is built on longs
> (unlike arrays and byte buffers) and we rely on C2 to optimize common
> cases where accessed offset is clearly a "small long". In some cases
> this check doesn't work (yet), and some "manual help" is needed. From
> my note with a conversation with Roland (who did most of the
> optimization work here):
>
>> The expectation is that the loop variable and the exit test operate
>> on a single type
> At the time, we had bigger fishes to fry, but if this turns out to be
> the reason behind the numbers you are seeing, then it might be time to
> look again and try to fix this.
>
> Cheers
> Maurizio
>
> [1] -
> https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/lang/foreign/UnrolledAccess.java
>
> On 20/03/2024 09:00, Antoine Chambille wrote:
>> Hi everyone,
>>
>> I'm looking at two array arithmetic benchmarks with Panama.
>> https://github.com/chamb/panama-benchmarks
>>
>> AddBenchmark: benchmark the element-wise addition of two arrays of
>> numbers. We test over standard Java arrays and (off-heap) native
>> memory, via array access, Unsafe and MemorySegment. Using and not
>> using the vector API.
>>
>> SumBenchmark: Sum all the elements in an array of numbers. We
>> benchmark over standard Java arrays and (off-heap) native memory, via
>> array access, Unsafe and MemorySegment. Using and not using the
>> vector API.
>>
>> I'm building openjdk from the master at
>> https://github.com/openjdk/panama-vector
>> Windows laptop with Intel core i9-11950H.
>>
>> Impressive to perform SIMD on native memory in pure Java! And I hope
>> it's possible to optimize it further.
>>
>> AddBenchmark
>> .scalarArrayArray 4741341.171 ops/s
>> .scalarArrayArrayLongStride 973926.689 ops/s
>> .scalarSegmentArray 1809480.000 ops/s
>> .scalarSegmentSegment 1231606.029 ops/s
>> .scalarUnsafeArray 10972240.434 ops/s
>> .scalarUnsafeUnsafe 1246565.503 ops/s
>> .unrolledArrayArray 1236491.068 ops/s
>> .unrolledSegmentArray 1787171.351 ops/s
>> .unrolledUnsafeArray 5700087.751 ops/s
>> .unrolledUnsafeUnsafe 1236456.434 ops/s
>> .vectorArrayArray 7252565.080 ops/s
>> .vectorArraySegment 6938948.826 ops/s
>> .vectorSegmentArray 4953042.042 ops/s
>> .vectorSegmentSegment 4606278.152 ops/s
>>
>> Loops over arrays seem automatically optimized, but not when the loop
>> has a 'long' stride.
>> Reading from Segment seems to defeat loop optimisations and/or add
>> overhead. It gets worse when writing to Segment.
>> Manual unrolling makes things worse in all cases.
>> The performance of 'scalarUnsafeArray' (read with Unsafe, write with
>> array) is twice faster than almost anything else.
>> The vector API is fast and consistent, but maybe not at its full
>> potential, and the use of Segment degrades performance.
>>
>>
>> SumBenchmark
>> .scalarArray 671030.727 ops/s
>> .scalarUnsafe 669296.228 ops/s
>> .unrolledArray 2600591.019 ops/s
>> .unrolledUnsafe 2448826.428 ops/s
>> .vectorArrayV1 7313657.874 ops/s
>> .vectorArrayV2 2239302.424 ops/s
>> .vectorSegmentV1 7470192.252 ops/s
>> .vectorSegmentV2 2183291.818 ops/s
>>
>> This is more in line. Manual unrolling seems to enable some level of
>> optimization, and then the vector API gives the best performance.
>>
>>
>> Best,
>> -Antoine
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20240320/39ad2a1f/attachment-0001.htm>
More information about the panama-dev
mailing list