RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v2]

Fri Jun 13 15:20:59 UTC 2025

On Tue, 3 Jun 2025 15:17:07 GMT, Bhavana Kilambi <bkilambi at openjdk.org> wrote:

>> Oh, I forgot that we have the `blend + rearrange` pattern if this op is not supported directly. Since `VectorRearrange` for 2D have been implemented now, did you check the final codegen of the default pattern? I think we can revisit the codegen first with the default pattern (i.e. `VectorBlend + VectorRearrange + VectorRearrange`), and find whether there is further improvement opportunity for that.  If so, we can implement the `SelectFromTwoVectors` op directly based on the improvement point. Otherwise, just keep using the default pattern will be fine to me.
>
> Hi @XiaohongGong , thanks for the idea. I did check the codegen and I saw that the iota vectors were being loaded twice for both the source vectors which I felt could be eliminated. So I created a separate implementation for `SelectFromTwoVector` with the code for both the `VectorRearrange` and `VectorBlend` as show below - 
> 
> 
>     lea(rscratch1,
>         ExternalAddress(StubRoutines::aarch64::vector_iota_indices() + 48));
>     ldrq(tmp1, rscratch1);
>     mov(tmp2, T2D, 0x01);
>     andr(tmp3, size1, index, tmp2);
>     cm(EQ, tmp3, size2, tmp1, tmp3);
>     orr(tmp1, T16B, tmp3, tmp3);
>     ext(tmp4, size1, src1, src1, 8);
>     ext(tmp5, size1, src2, src2, 8);
> 
>     cm(GE, dst, size2, tmp2, index);
>     bsl(tmp3, size1, src1, tmp4);
>  
>     bsl(tmp1, size1, src2, tmp5);
> 
>     bsl(dst, size1, tmp3, tmp1);
> 
> 
> 
> I have rearranged the instructions and used `tmp5` (I could have reused `tmp4` in the second `ext`) to allow for more ILP.
> 
> This implementation is certainly better than my previous implementation by ~23% for `double` and 31% for `long` but the performance is not much different from the default implementation (VectorRearrange + VectorBlend). For `double`, the performance is exactly the same and for `long` it is 0.97x. I collected some perf numbers for the cases with and without this patch. My implementation certainly executes fewer instructions compared to the default implementation but there is more ILP in the default implementation due to which it's performance is either better or the same as my implementation. I feel we can use the default implementation for `doubles` and `longs`? WDYT?

It's fine to me. Thanks for your testing! Using the mid-end IR pattern looks better that it may have other mid-end optimization opportunities in some case.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2125241593