RFR: 8351623: VectorAPI: Add SVE implementation of subword gather load operation

Thu Jul 17 02:43:48 UTC 2025

On Thu, 17 Jul 2025 01:20:44 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

> > > * case-2: 2 times of gather and merge
> > >   
> > >   * Can be refined. But the `LoadVectorGatherNode` should be changed to accept 2 `index` vectors.
> > > * case-3: 4 times of gather and merge (only for byte)
> > >   
> > >   * Can be refined. We can implement it just like:
> > >     step-1:  `v1 = gather1 + gather2 + 2 * uzp1`                        // merging the first and second gather-loads
> > >     step-2:  `v2 = gather3 + gather4 + 2 * uzp1`                        // merging the third and fourth gather-loads
> > >     step-3:  `v3 = slice (v2, v2)`,   `v = or(v1, v3)`                     // do the final merging
> > >     We have to change `LoadVectorGatherNode` as well. At least making it accept 2 `index` vectors.
> > > 
> > > As a summary, `LoadVectorGatherNode` will be more complex than before. But the good thing is, giving it one more `index` input is ok. I'm not sure whether this is appliable for other architectures like maybe RVV. But I can try with this change. Do you have better idea? Thanks!
> > 
> > 
> > @XiaohongGong thanks for your reply.
> > This idea generally looks good to me.
> > For case-2, we have
> > ```
> > gather1 + gather2 + uzp1:
> > [0a 0a 0a 0a 0a 0a 0a 0a]
> > [0b 0b 0b 0b 0b 0b 0b 0b]
> > uzp1.H  => 
> > [bb bb bb bb aa aa aa aa]
> > ```
> > 
> > 
> >     
> >       
> >     
> > 
> >       
> >     
> > 
> >     
> >   
> > Can we improve `case-3` by following the pattern of `case-2`?
> > ```
> > step-1:  v1 = gather1 + gather2 + uzp1 
> > [000a 000a 000a 000a 000a 000a 000a 000a]
> > [000b 000b 000b 000b 000b 000b 000b 000b]
> > uzp1.H => [0b0b 0b0b 0b0b 0b0b 0a0a 0a0a 0a0a 0a0a]
> > 
> > step-2:  v2 = gather3 + gather4 + uzp1 
> > [000c 000c 000c 000c 000c 000c 000c 000c]
> > [000d 000d 000d 000d 000d 000d 000d 000d]
> > uzp1.H => [0d0d 0d0d 0d0d 0d0d 0c0c 0c0c 0c0c 0c0c]
> > 
> > step-3:  v3 = uzp1 (v1, v2)
> > [0b0b 0b0b 0b0b 0b0b 0a0a 0a0a 0a0a 0a0a]
> > [0d0d 0d0d 0d0d 0d0d 0c0c 0c0c 0c0c 0c0c]
> > uzp1.B => [dddd dddd cccc cccc bbbb bbbb aaaa aaaa]
> > ```
> > 
> > 
> >     
> >       
> >     
> > 
> >       
> >     
> > 
> >     
> >   
> > Then we can also consistently define the semantics of `LoadVectorGatherNode` as `gather1 + gather2 + uzp1.H `, which would make backend much cleaner. WDYT?
> 
> Thanks! Regarding to the definitation of `LoadVectorGatherNode`, we'd better keep the vector type as it is for byte and short vectors. The SVE vector load gather instruction needs the type information. Additionally, the vector layout of the result should be matched with the vector type, right? We can handle this easily with pure backend implementation. But it seems not easy in mid-end IR level. BTW, `uzp1` is SVE specific instruction, we'd better define a common IR for that, which is also useful for other platforms that want to support subword gather API, right? I'm not sure whether this makes sense. I will take a considering for this suggestion.

Maybe I can define the vector type of `LoadVectorGatherNode` as int vector type for subword types. An additional flag is necessary to denote whether it is a byte or short loading. It only finishes the gather operation (without any truncating). And define an IR like `VectorConcateNode` to merge all the gather results. It can merge either two gathers or four gathers. For cases that only one time of gather is needed, we can just return a type cast node like `VectorCastI2X`. Seems this will make the IR more common and code more clean. WDYT?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/26236#issuecomment-3082248439