Array addition and array sum Panama benchmarks

Mon Sep 30 12:10:55 UTC 2024

Hi Antoine,
auto-vectorization on memory segments doesn't work in some cases. This 
issue is mostly due to:

https://bugs.openjdk.org/browse/JDK-8324751

That is, when working with a "source" and a "target" segment, if the 
auto-vectorizer cannot prove that the two segments are disjoint, no 
vectorization occurs.

This is an issue for operations like add, or copy, but it's not an issue 
with something like MemorySegment::fill (as that method only works on a 
single segment).

We hope to be able to make some progress on this issue, as that will 
allow 3rd party routines on memory segment to enjoy vectorization too 
w/o the need of having an intrinsics in the JDK.

Maurizio

On 30/09/2024 13:04, Antoine Chambille wrote:
> Hello everyone,
>
> I've rebuilt the latest OpenJDK (24) from 
> https://github.com/openjdk/panama-vector and run the arrays addition 
> benchmark another time:
>
> AddBenchmark
>  .scalarArrayArray            thrpt    5   6487636 ops/s
>  .scalarArrayArrayLongStride  thrpt    5   1001515 ops/s
>  .scalarSegmentArray          thrpt    5   1747531 ops/s
>  .scalarSegmentSegment        thrpt    5   1154193 ops/s
>  .scalarUnsafeArray           thrpt    5   6970073 ops/s
>  .scalarUnsafeUnsafe          thrpt    5   1246625 ops/s
>  .unrolledArrayArray          thrpt    5   1251824 ops/s
>  .unrolledSegmentArray        thrpt    5   1694164 ops/s
>  .unrolledUnsafeArray         thrpt    5   5043685 ops/s
>  .unrolledUnsafeUnsafe        thrpt    5   1197024 ops/s
>  .vectorArrayArray            thrpt    5   7200224 ops/s
>  .vectorArraySegment          thrpt    5   7377553 ops/s
>  .vectorSegmentArray          thrpt    5   7263505 ops/s
>  .vectorSegmentSegment        thrpt    5   7143647 ops/s
>
>   * Performance using the vector API is now very consistent and good
>     across arrays and segments.
>   * Reading and writing from/to segments still seems to be disrupting
>     auto-vectorization. Reading with Unsafe works well but it's marked
>     for removal.
>   * Less important, manual unrolling also seems to be disrupting
>     auto-vectorization.
>
>
>
> Best,
> -Antoine
>
> On Tue, Mar 26, 2024 at 5:40 PM Vladimir Ivanov 
> <vladimir.x.ivanov at oracle.com> wrote:
>
>
>     >> Personally, I prefer to see vectorizer handling "MoveX2Y (LoadX
>     mem)"
>     >> => "VectorReinterpret (LoadVector mem)" well and then introduce
>     rules to
>     >> strength-reduce it to mismatched access.
>     >
>     > Do I understand you right that you're saying the vector node for
>     MoveL2D
>     > (for instance) is VectorReinterpret so we could vectorize the code.
>     >
>     > Are you then suggesting that we can transform:
>     >
>     > (VectorReinterpret (LoadVector mem)
>     >
>     > into:
>     >
>     > (LoadVector mem)
>     >
>     > with that LoadVector a mismatched access?
>
>     Yes, but thinking more about it, the latter step may be optional. For
>     example, VectorReinterpret implementation on x86 is a no-op, so
>     not much
>     gained from folding VectorReinterpret+LoadVector into a mismatched
>     LoadVector.
>
>     Best regards,
>     Vladimir Ivanov
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20240930/25f674b8/attachment.htm>