RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation

Wed Jul 16 06:46:45 UTC 2025

On Wed, 16 Jul 2025 05:54:13 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

>>> Hi @XiaohongGong , thank you for doing this. As for testing, we can currently only test on 256-bit SVE machines (we no longer have any 512bit machines). We will get back to you with the results soon.
>> 
>> Testing on 256-bit SVE machines are fine to me. Thanks so much for your help!
>
>> @XiaohongGong thanks for your work! Tier1 - tier3 passed on `256-bit sve` machine without new failures.
> 
> Good! Thanks so much for your help!

> @XiaohongGong Please correct me if I’m missing something or got anything wrong.
> 
> Taking `short` on `512-bit` machine as an example, these instructions would be generated:
> 
> ```
> // vgather
> sve_dup vtmp, 0
> sve_load_0 =>  [0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a]
> sve_uzp1 with vtmp =>  [00 00 00 00 00 00 00 00 aa aa aa aa aa aa aa aa]
> 
> // vgather1
> sve_dup vtmp, 0
> sve_load_1 =>  [0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b]
> sve_uzp1 with vtmp =>  [00 00 00 00 00 00 00 00 bb bb bb bb bb bb bb bb]
> 
> // Slice vgather1, vgather1
> ext =>  [bb bb bb bb bb bb bb bb 00 00 00 00 00 00 00 00]
> 
> // Or vgather, vslice
> sve_orr =>  [bb bb bb bb bb bb bb bb aa aa aa aa aa aa aa aa]
> ```
> 
> Actually, we can get the target result directly by `uzp1` the output from `sve_load_0` and `sve_load_1`, like
> 
> ```
> [0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a]
> [0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b 0b]
> uzp1 => 
> [bb bb bb bb bb bb bb bb aa aa aa aa aa aa aa aa]
> ```
> 
> If so, the current design of `LoadVectorGather` may not be sufficiently low-level to suit `AArch64`. WDYT?

Yes, you are right! This can work for truncating and merging two gather load results. But we have to consider other scenarios together: 1) No merging 2) Need 4 times of gather-loads and merging.  Additionally, we have to make `LoadVectorGatherNode` common sense for all scenarios and different architectures. 

To make the IR itself simple and unify the inputs for all types on kinds of architectures, I choose to pass one `index` to it now, and define that one `LoadVectorGatherNode` just finish one time of gather-load with the `index`.  The element type of the result should be the subword type. So a followed type truncating is needed anyway. I think this makes sense for a single gather-load operation for subword types, right? 

For cases that need more than 1 time of gather, I choose to generate multiple `LoadVectorGatherNode` and do the merging at last. And, I agree this may make the code less efficient than that of implementing with one `LoadVectorGatherNode` for all different scenarios.  Writing backend assemblers for all scenarios can be more efficient. But this makes the backend implementation more complex. In additional to four normal gather cases, we have to consider the corresponding masked version and partial cases. BTW, the number of  `index` passed to `LoadVectorGatherNode` will be different (e.g. 1, 2, 4), which makes the IR itself not easy to maintain.

Regarding to the refinement based on your suggestion,
- case-1: no merging
   - It's not an issue (current version is fine)
- case-2: 2 times of gather and merge
   - Can be refined. But the `LoadVectorGatherNode` should be changed to accept 2 `index` vectors.
- case-3: 4 times of gather and merge (only for byte)
   - Can be refined. We can implement it just like: 
      step-1:  `v1 = gather1 + gather2 + 2 * uzp1`                        // merging the first and second gather-loads
      step-2:  `v2 = gather3 + gather4 + 2 * uzp1`                        // merging the third and fourth gather-loads
      step-3:  `v3 = slice (v2, v2)`,   `v = or(v1, v3)`                     // do the final merging
     We have to change `LoadVectorGatherNode` as well. At least making it accept 2 `index` vectors.

As a summary, `LoadVectorGatherNode` will be more complex than before. But the good thing is, giving it one more `index` input is ok. I'm not sure whether this is appliable for other architectures like maybe RVV.  But I can try with this change. Do you have better idea? Thanks!

-------------

PR Comment: https://git.openjdk.org/jdk/pull/26236#issuecomment-3077111123