Array addition and array sum Panama benchmarks

Wed Mar 20 11:46:36 UTC 2024

Hi,
I did some more analysis of your benchmark. Most of the non-unrolled, 
non-vectorized benchmarks perform similarly. Looking at the generated 
assembly, it seems that bound checks are correctly hoisted outside the 
loop. So the suggestion I made earlier is not going to make any 
difference - in fact the scalarSegmentSegment version is already 
competitive with scalarUnsafeUnsafe.

But there is indeed a big difference between scalarUnsafeUnsafe and 
everything else. The assembly indicates that this version is the only 
one that uses vectorized instructions (e.g. addpd instead of addsd). 
This probably indicates that /some/ optimization is failing (across the 
board, except in that specific case), but I’m not a C2 expert, so I 
can’t point to the underlying cause.

I’m adding Vlad and Roland, as they might have a better idea of what’s 
going on.

To make things easier to test, I’ve put together a jdk branch which 
includes a bunch of relevant benchmarks, including a subset of your 
AddBenchmark:

https://github.com/openjdk/jdk/compare/master...mcimadamore:jdk:AddBenchmark?expand=1

I think a good place to start would be to explain the difference between 
scalarUnsafeArray and scalarUnsafeUnsafe. The other benchmark can be 
looked at later, as (after looking at the assembly) I don’t think this 
is an issue that is specific to FFM.

For the records, here are the results I get:

|Benchmark Mode Cnt Score Error Units AddBenchmark.scalarArrayArray avgt 
30 94.475 ± 0.554 ns/op AddBenchmark.scalarArrayArrayLongStride avgt 30 
481.030 ± 4.477 ns/op AddBenchmark.scalarBufferArray avgt 30 339.244 ± 
4.546 ns/op AddBenchmark.scalarBufferBuffer avgt 30 329.813 ± 2.504 
ns/op AddBenchmark.scalarSegmentArray avgt 30 376.254 ± 5.192 ns/op 
AddBenchmark.scalarSegmentSegment avgt 30 302.793 ± 4.767 ns/op 
AddBenchmark.scalarSegmentSegmentLongStride avgt 30 305.078 ± 4.252 
ns/op AddBenchmark.scalarUnsafeArray avgt 30 95.765 ± 1.295 ns/op 
AddBenchmark.scalarUnsafeUnsafe avgt 30 358.060 ± 4.868 ns/op |

Cheers
Maurizio

On 20/03/2024 10:26, Maurizio Cimadamore wrote:

> Hi Antoine,
> thanks for the benchmark. From the numbers you are getting in the 
> AddBenchmark, my gut feeling is that, for memory segments, bound 
> checks are not being hoisted outside the loop. That would cause the 
> kind of degradation you are seeing here. I'm also surprised to see, 
> for that benchmark, that Unsafe is > 2x faster than using plain 
> arrays, after all the size of the array is a loop invariant, and no 
> check should occur there. On top of my head, I recall a similar issue 
> with a benchmark in our repository [1] (you will probably recognize 
> the shape there, as it's very similar to yours). In that case, to get 
> to optimal performance, some extra casts to `long` needed to be added 
> as C2 cannot yet optimize loops with that particular shape. Note that 
> all the bound check analysis on memory segments is built on longs 
> (unlike arrays and byte buffers) and we rely on C2 to optimize common 
> cases where accessed offset is clearly a "small long". In some cases 
> this check doesn't work (yet), and some "manual help" is needed. From 
> my note with a conversation with Roland (who did most of the 
> optimization work here):
>
>> The expectation is that the loop variable and the exit test operate 
>> on a single type
> At the time, we had bigger fishes to fry, but if this turns out to be 
> the reason behind the numbers you are seeing, then it might be time to 
> look again and try to fix this.
>
> Cheers
> Maurizio
>
> [1] - 
> https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/lang/foreign/UnrolledAccess.java
>
> On 20/03/2024 09:00, Antoine Chambille wrote:
>> Hi everyone,
>>
>> I'm looking at two array arithmetic benchmarks with Panama.
>> https://github.com/chamb/panama-benchmarks
>>
>> AddBenchmark: benchmark the element-wise addition of two arrays of 
>> numbers. We test over standard Java arrays and (off-heap) native 
>> memory, via array access, Unsafe and MemorySegment. Using and not 
>> using the vector API.
>>
>> SumBenchmark: Sum all the elements in an array of numbers. We 
>> benchmark over standard Java arrays and (off-heap) native memory, via 
>> array access, Unsafe and MemorySegment. Using and not using the 
>> vector API.
>>
>> I'm building openjdk from the master at 
>> https://github.com/openjdk/panama-vector
>> Windows laptop with Intel core i9-11950H.
>>
>> Impressive to perform SIMD on native memory in pure Java! And I hope 
>> it's possible to optimize it further.
>>
>> AddBenchmark
>>  .scalarArrayArray             4741341.171 ops/s
>>  .scalarArrayArrayLongStride    973926.689 ops/s
>>  .scalarSegmentArray           1809480.000 ops/s
>>  .scalarSegmentSegment         1231606.029 ops/s
>>  .scalarUnsafeArray           10972240.434 ops/s
>>  .scalarUnsafeUnsafe           1246565.503 ops/s
>>  .unrolledArrayArray           1236491.068 ops/s
>>  .unrolledSegmentArray         1787171.351 ops/s
>>  .unrolledUnsafeArray          5700087.751 ops/s
>>  .unrolledUnsafeUnsafe         1236456.434 ops/s
>>  .vectorArrayArray             7252565.080 ops/s
>>  .vectorArraySegment           6938948.826 ops/s
>>  .vectorSegmentArray           4953042.042 ops/s
>>  .vectorSegmentSegment         4606278.152 ops/s
>>
>> Loops over arrays seem automatically optimized, but not when the loop 
>> has a 'long' stride.
>> Reading from Segment seems to defeat loop optimisations and/or add 
>> overhead. It gets worse when writing to Segment.
>> Manual unrolling makes things worse in all cases.
>> The performance of 'scalarUnsafeArray' (read with Unsafe, write with 
>> array) is twice faster than almost anything else.
>> The vector API is fast and consistent, but maybe not at its full 
>> potential, and the use of Segment degrades performance.
>>
>>
>> SumBenchmark
>>  .scalarArray                   671030.727 ops/s
>>  .scalarUnsafe                  669296.228 ops/s
>>  .unrolledArray                2600591.019 ops/s
>>  .unrolledUnsafe               2448826.428 ops/s
>>  .vectorArrayV1                7313657.874 ops/s
>>  .vectorArrayV2                2239302.424 ops/s
>>  .vectorSegmentV1              7470192.252 ops/s
>>  .vectorSegmentV2              2183291.818 ops/s
>>
>> This is more in line. Manual unrolling seems to enable some level of 
>> optimization, and then the vector API gives the best performance.
>>
>>
>> Best,
>> -Antoine

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20240320/39ad2a1f/attachment-0001.htm>