RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation

Fei Gao fgao at openjdk.org
Wed Jul 16 14:53:50 UTC 2025


On Wed, 16 Jul 2025 06:44:13 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

> * case-2: 2 times of gather and merge
>   
>   * Can be refined. But the `LoadVectorGatherNode` should be changed to accept 2 `index` vectors.
> * case-3: 4 times of gather and merge (only for byte)
>   
>   * Can be refined. We can implement it just like:
>     step-1:  `v1 = gather1 + gather2 + 2 * uzp1`                        // merging the first and second gather-loads
>     step-2:  `v2 = gather3 + gather4 + 2 * uzp1`                        // merging the third and fourth gather-loads
>     step-3:  `v3 = slice (v2, v2)`,   `v = or(v1, v3)`                     // do the final merging
>     We have to change `LoadVectorGatherNode` as well. At least making it accept 2 `index` vectors.
> 
> As a summary, `LoadVectorGatherNode` will be more complex than before. But the good thing is, giving it one more `index` input is ok. I'm not sure whether this is appliable for other architectures like maybe RVV. But I can try with this change. Do you have better idea? Thanks!

@XiaohongGong thanks for your reply.

This idea generally looks good to me.

For case-2, we have

gather1 + gather2 + uzp1:
[0a 0a 0a 0a ... 0a 0a 0a 0a]
[0b 0b 0b 0b ... 0b 0b 0b 0b]
uzp1.H  => 
[bb bb bb bb ... aa aa aa aa]


Can we improve `case-3` by following the pattern of `case-2`?

step-1:  v1 = gather1 + gather2 + uzp1 
[000a 000a 000a 000a … 000a 000a 000a 000a]
[000b 000b 000b 000b … 000b 000b 000b 000b]
uzp1.H => [0b0b 0b0b 0b0b 0b0b … 0a0a 0a0a 0a0a 0a0a]

step-2:  v2 = gather3 + gather4 + uzp1 
[000c 000c 000c 000c … 000c 000c 000c 000c]
[000d 000d 000d 000d … 000d 000d 000d 000d]
uzp1.H => [0d0d 0d0d 0d0d 0d0d … 0c0c 0c0c 0c0c 0c0c]

step-3:  v3 = uzp1 (v1, v2)
[0b0b 0b0b 0b0b 0b0b … 0a0a 0a0a 0a0a 0a0a]
[0d0d 0d0d 0d0d 0d0d … 0c0c 0c0c 0c0c 0c0c]
uzp1.B => [dddd dddd cccc cccc … bbbb bbbb aaaa aaaa]


Then we can also consistently define the semantics of `LoadVectorGatherNode` as `gather1 + gather2 + uzp1.H `, which would make backend much cleaner. WDYT?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/26236#issuecomment-3078968856


More information about the hotspot-compiler-dev mailing list