Array addition and array sum Panama benchmarks

Wed Mar 20 09:00:17 UTC 2024

Hi everyone,

I'm looking at two array arithmetic benchmarks with Panama.
https://github.com/chamb/panama-benchmarks

AddBenchmark: benchmark the element-wise addition of two arrays of numbers.
We test over standard Java arrays and (off-heap) native memory, via array
access, Unsafe and MemorySegment. Using and not using the vector API.

SumBenchmark: Sum all the elements in an array of numbers. We benchmark
over standard Java arrays and (off-heap) native memory, via array access,
Unsafe and MemorySegment. Using and not using the vector API.

I'm building openjdk from the master at
https://github.com/openjdk/panama-vector
Windows laptop with Intel core i9-11950H.

Impressive to perform SIMD on native memory in pure Java! And I hope it's
possible to optimize it further.

AddBenchmark
 .scalarArrayArray             4741341.171 ops/s
 .scalarArrayArrayLongStride    973926.689 ops/s
 .scalarSegmentArray           1809480.000 ops/s
 .scalarSegmentSegment         1231606.029 ops/s
 .scalarUnsafeArray           10972240.434 ops/s
 .scalarUnsafeUnsafe           1246565.503 ops/s
 .unrolledArrayArray           1236491.068 ops/s
 .unrolledSegmentArray         1787171.351 ops/s
 .unrolledUnsafeArray          5700087.751 ops/s
 .unrolledUnsafeUnsafe         1236456.434 ops/s
 .vectorArrayArray             7252565.080 ops/s
 .vectorArraySegment           6938948.826 ops/s
 .vectorSegmentArray           4953042.042 ops/s
 .vectorSegmentSegment         4606278.152 ops/s

Loops over arrays seem automatically optimized, but not when the loop has a
'long' stride.
Reading from Segment seems to defeat loop optimisations and/or add
overhead. It gets worse when writing to Segment.
Manual unrolling makes things worse in all cases.
The performance of 'scalarUnsafeArray' (read with Unsafe, write with array)
is twice faster than almost anything else.
The vector API is fast and consistent, but maybe not at its full potential,
and the use of Segment degrades performance.

SumBenchmark
 .scalarArray                   671030.727 ops/s
 .scalarUnsafe                  669296.228 ops/s
 .unrolledArray                2600591.019 ops/s
 .unrolledUnsafe               2448826.428 ops/s
 .vectorArrayV1                7313657.874 ops/s
 .vectorArrayV2                2239302.424 ops/s
 .vectorSegmentV1              7470192.252 ops/s
 .vectorSegmentV2              2183291.818 ops/s

This is more in line. Manual unrolling seems to enable some level of
optimization, and then the vector API gives the best performance.

Best,
-Antoine
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20240320/3adfec3b/attachment.htm>