RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v5]

Mon Jun 6 10:44:50 UTC 2022

On Thu, 2 Jun 2022 03:27:59 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:

>> Currently the vector load with mask when the given index happens out of the array boundary is implemented with pure java scalar code to avoid the IOOBE (IndexOutOfBoundaryException). This is necessary for architectures that do not support the predicate feature. Because the masked load is implemented with a full vector load and a vector blend applied on it. And a full vector load will definitely cause the IOOBE which is not valid. However, for architectures that support the predicate feature like SVE/AVX-512/RVV, it can be vectorized with the predicated load instruction as long as the indexes of the masked lanes are within the bounds of the array. For these architectures, loading with unmasked lanes does not raise exception.
>> 
>> This patch adds the vectorization support for the masked load with IOOBE part. Please see the original java implementation (FIXME: optimize):
>> 
>> 
>>   @ForceInline
>>   public static
>>   ByteVector fromArray(VectorSpecies<Byte> species,
>>                        byte[] a, int offset,
>>                        VectorMask<Byte> m) {
>>   ByteSpecies vsp = (ByteSpecies) species;
>>       if (offset >= 0 && offset <= (a.length - species.length())) {
>>           return vsp.dummyVector().fromArray0(a, offset, m);
>>       }
>> 
>>       // FIXME: optimize
>>       checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
>>       return vsp.vOp(m, i -> a[offset + i]);
>>   }
>> 
>> Since it can only be vectorized with the predicate load, the hotspot must check whether the current backend supports it and falls back to the java scalar version if not. This is different from the normal masked vector load that the compiler will generate a full vector load and a vector blend if the predicate load is not supported. So to let the compiler make the expected action, an additional flag (i.e. `usePred`) is added to the existing "loadMasked" intrinsic, with the value "true" for the IOOBE part while "false" for the normal load. And the compiler will fail to intrinsify if the flag is "true" and the predicate load is not supported by the backend, which means that normal java path will be executed.
>> 
>> Also adds the same vectorization support for masked:
>>  - fromByteArray/fromByteBuffer
>>  - fromBooleanArray
>>  - fromCharArray
>> 
>> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` on the x86 AVX-512 system:
>> 
>> Benchmark                                          before   After  Units
>> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE   737.542 1387.069 ops/ms
>> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366  330.776 ops/ms
>> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE  233.832 6125.026 ops/ms
>> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE    233.816 7075.923 ops/ms
>> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE   119.771  330.587 ops/ms
>> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE  431.961  939.301 ops/ms
>> 
>> Similar performance gain can also be observed on 512-bit SVE system.
>
> Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits:
> 
>  - Merge branch 'jdk:master' into JDK-8283667
>  - Use integer constant for offsetInRange all the way through
>  - Rename "use_predicate" to "needs_predicate"
>  - Rename the "usePred" to "offsetInRange"
>  - 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature

test/micro/org/openjdk/bench/jdk/incubator/vector/LoadMaskedIOOBEBenchmark.java line 97:

> 95:     public void byteLoadArrayMaskIOOBE() {
> 96:         for (int i = 0; i < inSize; i += bspecies.length()) {
> 97:             VectorMask<Byte> mask = VectorMask.fromArray(bspecies, m, i);

For other case "if (offset >= 0 && offset <= (a.length - species.length())) )" we are anyways intrinsifying, should we limit this micro to work only for newly optimized case.

-------------

PR: https://git.openjdk.java.net/jdk/pull/8035