Array addition and array sum Panama benchmarks

Wed Mar 20 10:26:19 UTC 2024

Hi Antoine,
thanks for the benchmark. From the numbers you are getting in the 
AddBenchmark, my gut feeling is that, for memory segments, bound checks 
are not being hoisted outside the loop. That would cause the kind of 
degradation you are seeing here. I'm also surprised to see, for that 
benchmark, that Unsafe is > 2x faster than using plain arrays, after all 
the size of the array is a loop invariant, and no check should occur 
there. On top of my head, I recall a similar issue with a benchmark in 
our repository [1] (you will probably recognize the shape there, as it's 
very similar to yours). In that case, to get to optimal performance, 
some extra casts to `long` needed to be added as C2 cannot yet optimize 
loops with that particular shape. Note that all the bound check analysis 
on memory segments is built on longs (unlike arrays and byte buffers) 
and we rely on C2 to optimize common cases where accessed offset is 
clearly a "small long". In some cases this check doesn't work (yet), and 
some "manual help" is needed. From my note with a conversation with 
Roland (who did most of the optimization work here):

> The expectation is that the loop variable and the exit test operate on a single type
At the time, we had bigger fishes to fry, but if this turns out to be 
the reason behind the numbers you are seeing, then it might be time to 
look again and try to fix this.

Cheers
Maurizio

[1] - 
https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/lang/foreign/UnrolledAccess.java

On 20/03/2024 09:00, Antoine Chambille wrote:
> Hi everyone,
>
> I'm looking at two array arithmetic benchmarks with Panama.
> https://github.com/chamb/panama-benchmarks
>
> AddBenchmark: benchmark the element-wise addition of two arrays of 
> numbers. We test over standard Java arrays and (off-heap) native 
> memory, via array access, Unsafe and MemorySegment. Using and not 
> using the vector API.
>
> SumBenchmark: Sum all the elements in an array of numbers. We 
> benchmark over standard Java arrays and (off-heap) native memory, via 
> array access, Unsafe and MemorySegment. Using and not using the vector 
> API.
>
> I'm building openjdk from the master at 
> https://github.com/openjdk/panama-vector
> Windows laptop with Intel core i9-11950H.
>
> Impressive to perform SIMD on native memory in pure Java! And I hope 
> it's possible to optimize it further.
>
> AddBenchmark
>  .scalarArrayArray             4741341.171 ops/s
>  .scalarArrayArrayLongStride    973926.689 ops/s
>  .scalarSegmentArray           1809480.000 ops/s
>  .scalarSegmentSegment         1231606.029 ops/s
>  .scalarUnsafeArray           10972240.434 ops/s
>  .scalarUnsafeUnsafe           1246565.503 ops/s
>  .unrolledArrayArray           1236491.068 ops/s
>  .unrolledSegmentArray         1787171.351 ops/s
>  .unrolledUnsafeArray          5700087.751 ops/s
>  .unrolledUnsafeUnsafe         1236456.434 ops/s
>  .vectorArrayArray             7252565.080 ops/s
>  .vectorArraySegment           6938948.826 ops/s
>  .vectorSegmentArray           4953042.042 ops/s
>  .vectorSegmentSegment         4606278.152 ops/s
>
> Loops over arrays seem automatically optimized, but not when the loop 
> has a 'long' stride.
> Reading from Segment seems to defeat loop optimisations and/or add 
> overhead. It gets worse when writing to Segment.
> Manual unrolling makes things worse in all cases.
> The performance of 'scalarUnsafeArray' (read with Unsafe, write with 
> array) is twice faster than almost anything else.
> The vector API is fast and consistent, but maybe not at its full 
> potential, and the use of Segment degrades performance.
>
>
> SumBenchmark
>  .scalarArray                   671030.727 ops/s
>  .scalarUnsafe                  669296.228 ops/s
>  .unrolledArray                2600591.019 ops/s
>  .unrolledUnsafe               2448826.428 ops/s
>  .vectorArrayV1                7313657.874 ops/s
>  .vectorArrayV2                2239302.424 ops/s
>  .vectorSegmentV1              7470192.252 ops/s
>  .vectorSegmentV2              2183291.818 ops/s
>
> This is more in line. Manual unrolling seems to enable some level of 
> optimization, and then the vector API gives the best performance.
>
>
> Best,
> -Antoine