Array addition and array sum Panama benchmarks
Antoine Chambille
ach at activeviam.com
Wed Mar 20 09:00:17 UTC 2024
Hi everyone,
I'm looking at two array arithmetic benchmarks with Panama.
https://github.com/chamb/panama-benchmarks
AddBenchmark: benchmark the element-wise addition of two arrays of numbers.
We test over standard Java arrays and (off-heap) native memory, via array
access, Unsafe and MemorySegment. Using and not using the vector API.
SumBenchmark: Sum all the elements in an array of numbers. We benchmark
over standard Java arrays and (off-heap) native memory, via array access,
Unsafe and MemorySegment. Using and not using the vector API.
I'm building openjdk from the master at
https://github.com/openjdk/panama-vector
Windows laptop with Intel core i9-11950H.
Impressive to perform SIMD on native memory in pure Java! And I hope it's
possible to optimize it further.
AddBenchmark
.scalarArrayArray 4741341.171 ops/s
.scalarArrayArrayLongStride 973926.689 ops/s
.scalarSegmentArray 1809480.000 ops/s
.scalarSegmentSegment 1231606.029 ops/s
.scalarUnsafeArray 10972240.434 ops/s
.scalarUnsafeUnsafe 1246565.503 ops/s
.unrolledArrayArray 1236491.068 ops/s
.unrolledSegmentArray 1787171.351 ops/s
.unrolledUnsafeArray 5700087.751 ops/s
.unrolledUnsafeUnsafe 1236456.434 ops/s
.vectorArrayArray 7252565.080 ops/s
.vectorArraySegment 6938948.826 ops/s
.vectorSegmentArray 4953042.042 ops/s
.vectorSegmentSegment 4606278.152 ops/s
Loops over arrays seem automatically optimized, but not when the loop has a
'long' stride.
Reading from Segment seems to defeat loop optimisations and/or add
overhead. It gets worse when writing to Segment.
Manual unrolling makes things worse in all cases.
The performance of 'scalarUnsafeArray' (read with Unsafe, write with array)
is twice faster than almost anything else.
The vector API is fast and consistent, but maybe not at its full potential,
and the use of Segment degrades performance.
SumBenchmark
.scalarArray 671030.727 ops/s
.scalarUnsafe 669296.228 ops/s
.unrolledArray 2600591.019 ops/s
.unrolledUnsafe 2448826.428 ops/s
.vectorArrayV1 7313657.874 ops/s
.vectorArrayV2 2239302.424 ops/s
.vectorSegmentV1 7470192.252 ops/s
.vectorSegmentV2 2183291.818 ops/s
This is more in line. Manual unrolling seems to enable some level of
optimization, and then the vector API gives the best performance.
Best,
-Antoine
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20240320/3adfec3b/attachment.htm>
More information about the panama-dev
mailing list