Array addition and array sum Panama benchmarks

Wed Mar 20 12:38:16 UTC 2024

Thank you, Maurizio, for delving into the benchmark. I will follow this
thread closely!

Best,
-Antoine

On Wed, Mar 20, 2024 at 12:47 PM Maurizio Cimadamore <
maurizio.cimadamore at oracle.com> wrote:

> Hi,
> I did some more analysis of your benchmark. Most of the non-unrolled,
> non-vectorized benchmarks perform similarly. Looking at the generated
> assembly, it seems that bound checks are correctly hoisted outside the
> loop. So the suggestion I made earlier is not going to make any difference
> - in fact the scalarSegmentSegment version is already competitive with
> scalarUnsafeUnsafe.
>
> But there is indeed a big difference between scalarUnsafeUnsafe and
> everything else. The assembly indicates that this version is the only one
> that uses vectorized instructions (e.g. addpd instead of addsd). This
> probably indicates that *some* optimization is failing (across the board,
> except in that specific case), but I’m not a C2 expert, so I can’t point to
> the underlying cause.
>
> I’m adding Vlad and Roland, as they might have a better idea of what’s
> going on.
>
> To make things easier to test, I’ve put together a jdk branch which
> includes a bunch of relevant benchmarks, including a subset of your
> AddBenchmark:
>
>
> https://github.com/openjdk/jdk/compare/master...mcimadamore:jdk:AddBenchmark?expand=1
>
> I think a good place to start would be to explain the difference between
> scalarUnsafeArray and scalarUnsafeUnsafe. The other benchmark can be looked
> at later, as (after looking at the assembly) I don’t think this is an issue
> that is specific to FFM.
>
> For the records, here are the results I get:
>
> Benchmark                                    Mode  Cnt    Score   Error  Units
> AddBenchmark.scalarArrayArray                avgt   30   94.475 ± 0.554  ns/op
> AddBenchmark.scalarArrayArrayLongStride      avgt   30  481.030 ± 4.477  ns/op
> AddBenchmark.scalarBufferArray               avgt   30  339.244 ± 4.546  ns/op
> AddBenchmark.scalarBufferBuffer              avgt   30  329.813 ± 2.504  ns/op
> AddBenchmark.scalarSegmentArray              avgt   30  376.254 ± 5.192  ns/op
> AddBenchmark.scalarSegmentSegment            avgt   30  302.793 ± 4.767  ns/op
> AddBenchmark.scalarSegmentSegmentLongStride  avgt   30  305.078 ± 4.252  ns/op
> AddBenchmark.scalarUnsafeArray               avgt   30   95.765 ± 1.295  ns/op
> AddBenchmark.scalarUnsafeUnsafe              avgt   30  358.060 ± 4.868  ns/op
>
> Cheers
> Maurizio
>
> On 20/03/2024 10:26, Maurizio Cimadamore wrote:
>
> Hi Antoine,
> thanks for the benchmark. From the numbers you are getting in the
> AddBenchmark, my gut feeling is that, for memory segments, bound checks are
> not being hoisted outside the loop. That would cause the kind of
> degradation you are seeing here. I'm also surprised to see, for that
> benchmark, that Unsafe is > 2x faster than using plain arrays, after all
> the size of the array is a loop invariant, and no check should occur there.
> On top of my head, I recall a similar issue with a benchmark in our
> repository [1] (you will probably recognize the shape there, as it's very
> similar to yours). In that case, to get to optimal performance, some extra
> casts to `long` needed to be added as C2 cannot yet optimize loops with
> that particular shape. Note that all the bound check analysis on memory
> segments is built on longs (unlike arrays and byte buffers) and we rely on
> C2 to optimize common cases where accessed offset is clearly a "small
> long". In some cases this check doesn't work (yet), and some "manual help"
> is needed. From my note with a conversation with Roland (who did most of
> the optimization work here):
>
> The expectation is that the loop variable and the exit test operate on a
> single type
>
> At the time, we had bigger fishes to fry, but if this turns out to be the
> reason behind the numbers you are seeing, then it might be time to look
> again and try to fix this.
>
> Cheers
> Maurizio
>
> [1] -
> https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/lang/foreign/UnrolledAccess.java
>
> On 20/03/2024 09:00, Antoine Chambille wrote:
>
> Hi everyone,
>
> I'm looking at two array arithmetic benchmarks with Panama.
> https://github.com/chamb/panama-benchmarks
>
> AddBenchmark: benchmark the element-wise addition of two arrays of
> numbers. We test over standard Java arrays and (off-heap) native memory,
> via array access, Unsafe and MemorySegment. Using and not using the vector
> API.
>
> SumBenchmark: Sum all the elements in an array of numbers. We benchmark
> over standard Java arrays and (off-heap) native memory, via array access,
> Unsafe and MemorySegment. Using and not using the vector API.
>
> I'm building openjdk from the master at
> https://github.com/openjdk/panama-vector
> Windows laptop with Intel core i9-11950H.
>
> Impressive to perform SIMD on native memory in pure Java! And I hope it's
> possible to optimize it further.
>
> AddBenchmark
>  .scalarArrayArray             4741341.171 ops/s
>  .scalarArrayArrayLongStride    973926.689 ops/s
>  .scalarSegmentArray           1809480.000 ops/s
>  .scalarSegmentSegment         1231606.029 ops/s
>  .scalarUnsafeArray           10972240.434 ops/s
>  .scalarUnsafeUnsafe           1246565.503 ops/s
>  .unrolledArrayArray           1236491.068 ops/s
>  .unrolledSegmentArray         1787171.351 ops/s
>  .unrolledUnsafeArray          5700087.751 ops/s
>  .unrolledUnsafeUnsafe         1236456.434 ops/s
>  .vectorArrayArray             7252565.080 ops/s
>  .vectorArraySegment           6938948.826 ops/s
>  .vectorSegmentArray           4953042.042 ops/s
>  .vectorSegmentSegment         4606278.152 ops/s
>
> Loops over arrays seem automatically optimized, but not when the loop has
> a 'long' stride.
> Reading from Segment seems to defeat loop optimisations and/or add
> overhead. It gets worse when writing to Segment.
> Manual unrolling makes things worse in all cases.
> The performance of 'scalarUnsafeArray' (read with Unsafe, write with
> array) is twice faster than almost anything else.
> The vector API is fast and consistent, but maybe not at its full
> potential, and the use of Segment degrades performance.
>
>
> SumBenchmark
>  .scalarArray                   671030.727 ops/s
>  .scalarUnsafe                  669296.228 ops/s
>  .unrolledArray                2600591.019 ops/s
>  .unrolledUnsafe               2448826.428 ops/s
>  .vectorArrayV1                7313657.874 ops/s
>  .vectorArrayV2                2239302.424 ops/s
>  .vectorSegmentV1              7470192.252 ops/s
>  .vectorSegmentV2              2183291.818 ops/s
>
> This is more in line. Manual unrolling seems to enable some level of
> optimization, and then the vector API gives the best performance.
>
>
> Best,
> -Antoine
>
> 
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20240320/913b4032/attachment-0001.htm>