RFR: 8348868: AArch64: Add backend support for SelectFromTwoVector [v2]

Fri Jun 13 15:20:59 UTC 2025

On Tue, 3 Jun 2025 09:09:21 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

>> Hi @XiaohongGong , I just got back to working on this PR again!
>> I have been trying to implement this operation for Doubles/Longs but the performance is 0.8x that of the default implementation (with two vector rearranges and a vector blend). The implementation using `bsl` that I used is given below - 
>> 
>> 
>>     dup(tmp1, T2D, src1, 0);
>>     dup(tmp2, T2D, src1, 1);
>> 
>>     mov(tmp3, T2D, 0x01);
>>     andr(tmp4, T16B, index, tmp3);
>>     negr(tmp4, T2D, tmp4);
>>     orr(tmp5, T16B, tmp4, tmp4);
>> 
>>     bsl(tmp4, T16B, tmp2, tmp1);
>> 
>>     dup(tmp1, T2D, src2, 0);
>>     dup(tmp2, T2D, src2, 1);
>> 
>>     bsl(tmp5, T16B, tmp2, tmp1);
>> 
>>     sshr(dst, T2D, index, 1);
>>     andr(dst, T16B, dst, tmp3);
>>     negr(dst, T2D, dst);
>> 
>>     bsl(dst, T16B, tmp5, tmp4);
>> 
>> 
>> 
>> This is based on the fact that the index vector can only contain values = 0 to 3. If the first bit is 0/1 it refers to the first or second double/long and if the second bit is 0/1 it selects the source (either src1/src2). 
>> index =  00 -> choose first double/long of src1
>>               01 -> choose second double/long of src1
>>               10 -> choose first double/long of src2
>>               11 -> choose second double/long of src2
>>               
>> I am not able to avoid duplicating the source elements. 
>> Would it be ok if I do not support SelectFromTwoVector for doubles/longs or do you have any suggestion on how I can improve my implementation?
>
> Oh, I forgot that we have the `blend + rearrange` pattern if this op is not supported directly. Since `VectorRearrange` for 2D have been implemented now, did you check the final codegen of the default pattern? I think we can revisit the codegen first with the default pattern (i.e. `VectorBlend + VectorRearrange + VectorRearrange`), and find whether there is further improvement opportunity for that.  If so, we can implement the `SelectFromTwoVectors` op directly based on the improvement point. Otherwise, just keep using the default pattern will be fine to me.

Hi @XiaohongGong , thanks for the idea. I did check the codegen and I saw that the iota vectors were being loaded twice for both the source vectors which I felt could be eliminated. So I created a separate implementation for `SelectFromTwoVector` with the code for both the `VectorRearrange` and `VectorBlend` as show below - 

    lea(rscratch1,
        ExternalAddress(StubRoutines::aarch64::vector_iota_indices() + 48));
    ldrq(tmp1, rscratch1);
    mov(tmp2, T2D, 0x01);
    andr(tmp3, size1, index, tmp2);
    cm(EQ, tmp3, size2, tmp1, tmp3);
    orr(tmp1, T16B, tmp3, tmp3);
    ext(tmp4, size1, src1, src1, 8);
    ext(tmp5, size1, src2, src2, 8);

    cm(GE, dst, size2, tmp2, index);
    bsl(tmp3, size1, src1, tmp4);

    bsl(tmp1, size1, src2, tmp5);

    bsl(dst, size1, tmp3, tmp1);

I have rearranged the instructions and used `tmp5` (I could have reused `tmp4` in the second `ext`) to allow for more ILP.

This implementation is certainly better than my previous implementation by ~23% for `double` and 31% for `long` but the performance is not much different from the default implementation (VectorRearrange + VectorBlend). For `double`, the performance is exactly the same and for `long` it is 0.97x. I collected some perf numbers for the cases with and without this patch. My implementation certainly executes fewer instructions compared to the default implementation but there is more ILP in the default implementation due to which it's performance is either better or the same as my implementation. I feel we can use the default implementation for `doubles` and `longs`? WDYT?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23570#discussion_r2124204144