<div dir="ltr">Hi everyone,<br><br>I'm looking at two array arithmetic benchmarks with Panama.<br><a href="https://github.com/chamb/panama-benchmarks">https://github.com/chamb/panama-benchmarks</a><br><br>AddBenchmark: benchmark the element-wise addition of two arrays of numbers. We test over standard Java arrays and (off-heap) native memory, via array access, Unsafe and MemorySegment. Using and not using the vector API.<br><br>SumBenchmark: Sum all the elements in an array of numbers. We benchmark over standard Java arrays and (off-heap) native memory, via array access, Unsafe and MemorySegment. Using and not using the vector API.<br><br>I'm building openjdk from the master at <a href="https://github.com/openjdk/panama-vector">https://github.com/openjdk/panama-vector</a><br>Windows laptop with Intel core i9-11950H.<br><br>Impressive to perform SIMD on native memory in pure Java! And I hope it's possible to optimize it further.<br><br><font face="monospace">AddBenchmark<br> .scalarArrayArray             4741341.171 ops/s  <br> .scalarArrayArrayLongStride    973926.689 ops/s<br> .scalarSegmentArray           1809480.000 ops/s<br> .scalarSegmentSegment         1231606.029 ops/s<br> .scalarUnsafeArray           10972240.434 ops/s<br> .scalarUnsafeUnsafe           1246565.503 ops/s  <br> .unrolledArrayArray           1236491.068 ops/s<br> .unrolledSegmentArray         1787171.351 ops/s<br> .unrolledUnsafeArray          5700087.751 ops/s<br> .unrolledUnsafeUnsafe         1236456.434 ops/s<br> .vectorArrayArray             7252565.080 ops/s<br> .vectorArraySegment           6938948.826 ops/s<br> .vectorSegmentArray           4953042.042 ops/s<br> .vectorSegmentSegment         4606278.152 ops/s</font><br><br>Loops over arrays seem automatically optimized, but not when the loop has a 'long' stride.<br>Reading from Segment seems to defeat loop optimisations and/or add overhead. It gets worse when writing to Segment.<br>Manual unrolling makes things worse in all cases.<br>The performance of 'scalarUnsafeArray' (read with Unsafe, write with array) is twice faster than almost anything else.<br>The vector API is fast and consistent, but maybe not at its full potential, and the use of Segment degrades performance.<br><br><font face="monospace"><br>SumBenchmark<br> .scalarArray                   671030.727 ops/s<br> .scalarUnsafe                  669296.228 ops/s<br> .unrolledArray                2600591.019 ops/s<br> .unrolledUnsafe               2448826.428 ops/s<br> .vectorArrayV1                7313657.874 ops/s<br> .vectorArrayV2                2239302.424 ops/s<br> .vectorSegmentV1              7470192.252 ops/s<br> .vectorSegmentV2              2183291.818 ops/s</font><br><br>This is more in line. Manual unrolling seems to enable some level of optimization, and then the vector API gives the best performance.<br><div><br></div><div><br></div><div>Best,</div><div>-Antoine</div></div>