RFR: 8293198: [vectorapi] Improve the implementation of VectorMask.indexInRange() [v2]

Tue Feb 7 02:29:45 UTC 2023

On Mon, 6 Feb 2023 17:39:42 GMT, Paul Sandoz <psandoz at openjdk.org> wrote:

>> Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Add smaller array size for benchmark tests
>
> I think it would be useful to adjust the naming and comments of some methods to make it clearer the method parameter constraints.
> 
> `indexInRange0Helper` is now called if the index is partially or totally out of range at the lower or upper ends and `indexInRange0` is called if partially or totally out of range at the upper end.
> e.g. a more literal naming could be:
> `AbstractMask::indexInRange0Helper` -> `AbstractMask::indexPartiallyInRangeHelper`
> `VectorSupport::indexInRange` -> VectorSupport::indexPartiallyInUpperRange`
> ?
> 
> IIUC the performance numbers show that when the array is not a multiple of the vector size there is still quite an impact overall to calling `VectorSupport::indexInRange` for the last loop iteration. I am guessing the overall loop shape is different which impacts other optimizations?
> 
> To do this more optimally likely requires a loop transformation where the last loop iteration is peeled off, but that's a harder transformation in one of the more complicated areas of C2 (although it already supports pre/post loop, so maybe its possible to leverage that?).

Thanks for looking at this PR @PaulSandoz !

> I think it would be useful to adjust the naming and comments of some methods to make it clearer the method parameter constraints.
> 
> `indexInRange0Helper` is now called if the index is partially or totally out of range at the lower or upper ends and `indexInRange0` is called if partially or totally out of range at the upper end. e.g. a more literal naming could be: `AbstractMask::indexInRange0Helper` -> `AbstractMask::indexPartiallyInRangeHelper` `VectorSupport::indexInRange` -> VectorSupport::indexPartiallyInUpperRange` ?

The renaming looks good to me. Thanks!

> IIUC the performance numbers show that when the array is not a multiple of the vector size there is still quite an impact overall to calling `VectorSupport::indexInRange` for the last loop iteration. I am guessing the overall loop shape is different which impacts other optimizations?

I think the main influence of the benchmark result comes from the masked ` fromArray()/intoArray()` APIs, especially the masked intoArray() API. For the tail loop part, there is the vector boxing needed on all architectures, with the following reasons:

- If the architecture doesn't support predicate feature, it cannot be intrinsified.
- The `checkMaskFromIndexSize` called inside the `else->if` branch may not be inlined, and the `indexInRange()` generated mask `m` needs the boxing before it.

public final
    void intoArray(double[] a, int offset,
                   VectorMask<Double> m) {
        if (m.allTrue()) {
            intoArray(a, offset);
        } else {
            DoubleSpecies vsp = vspecies();
            if (!VectorIntrinsics.indexInRange(offset, vsp.length(), a.length)) {
                checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
            }
            intoArray0(a, offset, m);
        }
    }

If the array size is aligned with the vector size, the generated `m` is all true. Hence, the non-masked `intoArray()` is called instead, which improves the performance a lot.

Regarding to the `indexInRange()` API implementation, if the array size is the multiple num of vector size, the branch for the tail loop part is optimized out to an uncommon-trap by C2 compiler, which may improves the performance as well. 

Regarding to the added benchmark, since it is a testing for `indexInRange`, maybe we can remove the calling to the masked `fromArray()/intoArray()` APIs and directly save the mask into a boolean array instead. I guess this may reduce the overall performance gap. 

> 
> To do this more optimally likely requires a loop transformation where the last loop iteration is peeled off, but that's a harder transformation in one of the more complicated areas of C2 (although it already supports pre/post loop, so maybe its possible to leverage that?).

Yes, it is!

-------------

PR: https://git.openjdk.org/jdk/pull/12064