RFR: 8355563: VectorAPI: Refactor current implementation of subword gather load API

Tue May 20 02:24:51 UTC 2025

On Mon, 19 May 2025 03:10:46 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

>> JDK-8318650 introduced hotspot intrinsification of subword gather load APIs for X86 platforms [1]. However, the current implementation is not optimal for AArch64 SVE platform, which natively supports vector instructions for subword gather load operations using an int vector for indices (see [2][3]).
>> 
>> Two key areas require improvement:
>> 1. At the Java level, vector indices generated for range validation could be reused for the subsequent gather load operation on architectures with native vector instructions like AArch64 SVE. However, the current implementation prevents compiler reuse of these index vectors due to divergent control flow, potentially impacting performance.
>> 2. At the compiler IR level, the additional `offset` input for `LoadVectorGather`/`LoadVectorGatherMasked` with subword types  increases IR complexity and complicates backend implementation. Furthermore, generating `add` instructions before each memory access negatively impacts performance.
>> 
>> This patch refactors the implementation at both the Java level and compiler mid-end to improve efficiency and maintainability across different architectures.
>> 
>> Main changes:
>> 1. Java-side API refactoring:
>>    - Explicitly passes generated index vectors to hotspot, eliminating duplicate index vectors for gather load instructions on
>>      architectures like AArch64.
>> 2. C2 compiler IR refactoring:
>>    - Refactors `LoadVectorGather`/`LoadVectorGatherMasked` IR for subword types by removing the memory offset input and incorporating it into the memory base `addr` at the IR level. This simplifies backend implementation, reduces add operations, and unifies the IR across all types.
>> 3. Backend changes:
>>    - Streamlines X86 implementation of subword gather operations following the removal of the offset input from the IR level.
>> 
>> Performance:
>> The performance of the relative JMH improves up to 27% on a X86 AVX512 system. Please see the data below:
>> 
>> Benchmark                                                 Mode   Cnt Unit    SIZE    Before      After    Gain
>> GatherOperationsBenchmark.microByteGather128              thrpt  30  ops/ms  64    53682.012   52650.325  0.98
>> GatherOperationsBenchmark.microByteGather128              thrpt  30  ops/ms  256   14484.252   14255.156  0.98
>> GatherOperationsBenchmark.microByteGather128              thrpt  30  ops/ms  1024   3664.900    3595.615  0.98
>> GatherOperationsBenchmark.microByteGather128              thrpt  30  ops/ms  4096    908.31...
>
> Ping again~  could any one please take a look at this PR？ Thanks a lot!

> Hi @XiaohongGong , Very nice work!, Looks good to me, will do some testing and get back.
> 
> Do you have any idea about following regression?
> 
> ```
> GatherOperationsBenchmark.microByteGather256              thrpt  30  ops/ms  64    55844.814   48311.847  0.86
> GatherOperationsBenchmark.microByteGather256              thrpt  30  ops/ms  256   15139.459   13009.848  0.85
> GatherOperationsBenchmark.microByteGather256              thrpt  30  ops/ms  1024   3861.834    3284.944  0.85
> GatherOperationsBenchmark.microByteGather256              thrpt  30  ops/ms  4096    938.665     817.673  0.87
> ```
> 
> Best Regards

Yes, I also observed such regression. After analyzing, I found it was caused by the java side changes, which influences the range check elimination inside `IntVector.fromArray()` and `ByteVector.intoArray()` in the benchmark. The root cause is the counted loop in following benchmark case is not recognized by compiler as expected:

public void microByteGather256() {
        for (int i = 0; i < SIZE; i += B256.length()) {
            ByteVector.fromArray(B256, barr, 0, index, i)
                            .intoArray(bres, i);
        }
 }
``` 
The loop iv phi node is not recognized successfully when C2 recognize the counted loop pattern, because it was casted twice with `CastII` in this case. The ideal graph looks like:

                           Loop
                               \
                                    \        / -------------------------
                                       \  /                                        |
                                       Phi                                        |
                                         |                                          | 
                                      CastII                                     |
                                          |                                         |
                                      CastII                                     |
                                          |                                         |
                                             \       ConI                       |
                                                 \     |                            |
                                                AddVI                          |
                                                      |--------------------|                           

Relative code is https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopnode.cpp#L1667. 

Before the change, the graph should be:

                           Loop
                               \
                                    \        / -------------------------
                                       \  /                                        |
                                       Phi                                        |
                                         |                                          | 
                                      CastII                                     |
                                          |                                         |                                       |
                                             \       ConI                       |
                                                 \     |                            |
                                                AddVI                          |
                                                      |--------------------|    
``` 
The difference comes from the index generation in `ByteVector.fromArray()` (I mean calling of `IntVector.fromArray()` in java). Before, the `i` in above loop is not directly used by `IntVector.fromArray()`. Instead, it was used by another loop phi node. Hence, there is no additional `CastII`. But after my change, the loop in the java implementation of `ByteVector.fromArray()` is removed, and `i` is directly used by `IntVector.fromArray()` . It will be used by boundary check before loading the indexes. Hence another `CastII` is generated.

When recognizing the loop iv phi node, it will check whether the `Phi` is used by a `CastII` first. And get the input of the `CastII` if then. But this check only happens once. See https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopnode.cpp#L1659

  if (xphi->Opcode() == Op_Cast(iv_bt)) {
    xphi = xphi->in(1);
  }

Once the `PhiNode` is casted more than one times, the pattern is not recognized. I think we should refine the logic of recognizing the counted loop, by changing above `if` to `while` to make the iv phi node is recognized successfully. Potential change should be:

  while (xphi->Opcode() == Op_Cast(iv_bt)) {
    xphi = xphi->in(1);
  }

I'v tested this change, and found the benchmarks with regression can be improved as before. Consider I'm not familiar with C2's loop transform code, I prefer to do more investigation for this issue, and may fix it with a followed-up patch. Any suggestions?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/25138#issuecomment-2892720654