RFR: 8293198: [vectorapi] Improve the implementation of VectorMask.indexInRange() [v2]
Xiaohong Gong
xgong at openjdk.org
Tue Feb 7 02:29:45 UTC 2023
On Mon, 6 Feb 2023 17:39:42 GMT, Paul Sandoz <psandoz at openjdk.org> wrote:
>> Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision:
>>
>> Add smaller array size for benchmark tests
>
> I think it would be useful to adjust the naming and comments of some methods to make it clearer the method parameter constraints.
>
> `indexInRange0Helper` is now called if the index is partially or totally out of range at the lower or upper ends and `indexInRange0` is called if partially or totally out of range at the upper end.
> e.g. a more literal naming could be:
> `AbstractMask::indexInRange0Helper` -> `AbstractMask::indexPartiallyInRangeHelper`
> `VectorSupport::indexInRange` -> VectorSupport::indexPartiallyInUpperRange`
> ?
>
> IIUC the performance numbers show that when the array is not a multiple of the vector size there is still quite an impact overall to calling `VectorSupport::indexInRange` for the last loop iteration. I am guessing the overall loop shape is different which impacts other optimizations?
>
> To do this more optimally likely requires a loop transformation where the last loop iteration is peeled off, but that's a harder transformation in one of the more complicated areas of C2 (although it already supports pre/post loop, so maybe its possible to leverage that?).
Thanks for looking at this PR @PaulSandoz !
> I think it would be useful to adjust the naming and comments of some methods to make it clearer the method parameter constraints.
>
> `indexInRange0Helper` is now called if the index is partially or totally out of range at the lower or upper ends and `indexInRange0` is called if partially or totally out of range at the upper end. e.g. a more literal naming could be: `AbstractMask::indexInRange0Helper` -> `AbstractMask::indexPartiallyInRangeHelper` `VectorSupport::indexInRange` -> VectorSupport::indexPartiallyInUpperRange` ?
The renaming looks good to me. Thanks!
> IIUC the performance numbers show that when the array is not a multiple of the vector size there is still quite an impact overall to calling `VectorSupport::indexInRange` for the last loop iteration. I am guessing the overall loop shape is different which impacts other optimizations?
I think the main influence of the benchmark result comes from the masked ` fromArray()/intoArray()` APIs, especially the masked intoArray() API. For the tail loop part, there is the vector boxing needed on all architectures, with the following reasons:
- If the architecture doesn't support predicate feature, it cannot be intrinsified.
- The `checkMaskFromIndexSize` called inside the `else->if` branch may not be inlined, and the `indexInRange()` generated mask `m` needs the boxing before it.
public final
void intoArray(double[] a, int offset,
VectorMask<Double> m) {
if (m.allTrue()) {
intoArray(a, offset);
} else {
DoubleSpecies vsp = vspecies();
if (!VectorIntrinsics.indexInRange(offset, vsp.length(), a.length)) {
checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
}
intoArray0(a, offset, m);
}
}
If the array size is aligned with the vector size, the generated `m` is all true. Hence, the non-masked `intoArray()` is called instead, which improves the performance a lot.
Regarding to the `indexInRange()` API implementation, if the array size is the multiple num of vector size, the branch for the tail loop part is optimized out to an uncommon-trap by C2 compiler, which may improves the performance as well.
Regarding to the added benchmark, since it is a testing for `indexInRange`, maybe we can remove the calling to the masked `fromArray()/intoArray()` APIs and directly save the mask into a boolean array instead. I guess this may reduce the overall performance gap.
>
> To do this more optimally likely requires a loop transformation where the last loop iteration is peeled off, but that's a harder transformation in one of the more complicated areas of C2 (although it already supports pre/post loop, so maybe its possible to leverage that?).
Yes, it is!
-------------
PR: https://git.openjdk.org/jdk/pull/12064
More information about the hotspot-compiler-dev
mailing list