RFR: 8293198: [vectorapi] Improve the implementation of VectorMask.indexInRange()
Xiaohong Gong
xgong at openjdk.org
Fri Feb 3 01:47:52 UTC 2023
On Thu, 2 Feb 2023 18:07:59 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
>> Hi, I modified a version by using the old implementation for the tail loop instead of adding the new intrinsics. The code looks like:
>>
>> public VectorMask<E> indexInRange(int offset, int limit) {
>> int vlength = length();
>> if (offset >= 0 && limit - offset >= length()) {
>> return this;
>> } else if (offset >= limit) {
>> return vectorSpecies().maskAll(false);
>> }
>>
>> Vector<E> iota = vectorSpecies().zero().addIndex(1);
>> VectorMask<E> badMask = checkIndex0(offset, limit, iota, vlength);
>> return this.andNot(badMask);
>> }
>>
>> And I tested the performance of the new added benchmarks with different vector size on NEON/SVE and x86 avx2/avx512 architectures. The results show that the performance of changed version is not better than the current version, if the array size is not aligned with the vector size, especially for the double/long type with larger size.
>>
>> Here are some raw data with NEON:
>>
>> Benchmark (size) Mode Cnt current modified Units
>> IndexInRangeBenchmark.byteIndexInRange 131 thrpt 5 2654.919 2584.423 ops/ms
>> IndexInRangeBenchmark.byteIndexInRange 259 thrpt 5 1830.364 1802.876 ops/ms
>> IndexInRangeBenchmark.byteIndexInRange 515 thrpt 5 1058.548 1073.742 ops/ms
>> IndexInRangeBenchmark.doubleIndexInRange 131 thrpt 5 594.920 593.832 ops/ms
>> IndexInRangeBenchmark.doubleIndexInRange 259 thrpt 5 308.678 149.279 ops/ms
>> IndexInRangeBenchmark.doubleIndexInRange 515 thrpt 5 160.639 74.579 ops/ms
>> IndexInRangeBenchmark.floatIndexInRange 131 thrpt 5 1097.567 1104.008 ops/ms
>> IndexInRangeBenchmark.floatIndexInRange 259 thrpt 5 617.845 606.886 ops/ms
>> IndexInRangeBenchmark.floatIndexInRange 515 thrpt 5 315.978 152.046 ops/ms
>> IndexInRangeBenchmark.intIndexInRange 131 thrpt 5 1165.279 1205.486 ops/ms
>> IndexInRangeBenchmark.intIndexInRange 259 thrpt 5 633.648 631.672 ops/ms
>> IndexInRangeBenchmark.intIndexInRange 515 thrpt 5 315.370 154.144 ops/ms
>> IndexInRangeBenchmark.longIndexInRange 131 thrpt 5 639.840 633.623 ops/ms
>> IndexInRangeBenchmark.longIndexInRange 259 thrpt 5 312.267 152.788 ops/ms
>> IndexInRangeBenchmark.longIndexInRange 515 thrpt 5 163.028 78.150 ops/ms
>> IndexInRangeBenchmark.shortIndexInRange 131 thrpt 5 1834.318 1800.318 ops/ms
>> IndexInRangeBenchmark.shortIndexInRange 259 thrpt 5 1105.695 1094.347 ops/ms
>> IndexInRangeBenchmark.shortIndexInRange 515 thrpt 5 602.442 599.827 ops/ms
>>
>>
>> And the data with SVE 256-bit vector size:
>>
>> Benchmark (size) Mode Cnt current modified Units
>> IndexInRangeBenchmark.byteIndexInRange 131 thrpt 5 23772.370 22921.113 ops/ms
>> IndexInRangeBenchmark.byteIndexInRange 259 thrpt 5 18930.388 17920.910 ops/ms
>> IndexInRangeBenchmark.byteIndexInRange 515 thrpt 5 13528.610 13282.504 ops/ms
>> IndexInRangeBenchmark.doubleIndexInRange 131 thrpt 5 7850.522 7975.720 ops/ms
>> IndexInRangeBenchmark.doubleIndexInRange 259 thrpt 5 4281.749 4373.926 ops/ms
>> IndexInRangeBenchmark.doubleIndexInRange 515 thrpt 5 2160.001 604.458 ops/ms
>> IndexInRangeBenchmark.floatIndexInRange 131 thrpt 5 13594.943 13306.904 ops/ms
>> IndexInRangeBenchmark.floatIndexInRange 259 thrpt 5 8163.134 7912.343 ops/ms
>> IndexInRangeBenchmark.floatIndexInRange 515 thrpt 5 4335.529 4198.555 ops/ms
>> IndexInRangeBenchmark.intIndexInRange 131 thrpt 5 22106.880 20348.266 ops/ms
>> IndexInRangeBenchmark.intIndexInRange 259 thrpt 5 11711.588 10958.299 ops/ms
>> IndexInRangeBenchmark.intIndexInRange 515 thrpt 5 5501.034 5358.441 ops/ms
>> IndexInRangeBenchmark.longIndexInRange 131 thrpt 5 9832.578 9829.398 ops/ms
>> IndexInRangeBenchmark.longIndexInRange 259 thrpt 5 4979.326 4947.166 ops/ms
>> IndexInRangeBenchmark.longIndexInRange 515 thrpt 5 2269.131 614.204 ops/ms
>> IndexInRangeBenchmark.shortIndexInRange 131 thrpt 5 19865.866 19297.628 ops/ms
>> IndexInRangeBenchmark.shortIndexInRange 259 thrpt 5 14005.214 13592.407 ops/ms
>> IndexInRangeBenchmark.shortIndexInRange 515 thrpt 5 8766.450 8531.675 ops/ms
>>
>> As a conclusion, I prefer to keep the current version. WDYT?
>
> Yes, I agree, I did collect performance data with your benchmarks at various AVX levels and see significant gains. Cause of performance variation b/w integer and floating point cases is due to below limitation.
> https://github.com/openjdk/jdk/pull/12064#discussion_r1094101761
> Which can be addressed in a separate PR.
>
> FTR here are performance numbers at UseAVX=3
>
>
> Benchmark (size) Mode Cnt Score Error Units
> IndexInRangeBenchmark.byteIndexInRange 1024 thrpt 2 74983.406 ops/ms
> IndexInRangeBenchmark.byteIndexInRange 1027 thrpt 2 19156.962 ops/ms
> IndexInRangeBenchmark.doubleIndexInRange 1024 thrpt 2 11368.179 ops/ms
> IndexInRangeBenchmark.doubleIndexInRange 1027 thrpt 2 2165.207 ops/ms
> IndexInRangeBenchmark.floatIndexInRange 1024 thrpt 2 18736.787 ops/ms
> IndexInRangeBenchmark.floatIndexInRange 1027 thrpt 2 3798.996 ops/ms
> IndexInRangeBenchmark.intIndexInRange 1024 thrpt 2 18797.863 ops/ms
> IndexInRangeBenchmark.intIndexInRange 1027 thrpt 2 5455.317 ops/ms
> IndexInRangeBenchmark.longIndexInRange 1024 thrpt 2 11866.493 ops/ms
> IndexInRangeBenchmark.longIndexInRange 1027 thrpt 2 2227.896 ops/ms
> IndexInRangeBenchmark.shortIndexInRange 1024 thrpt 2 46921.520 ops/ms
> IndexInRangeBenchmark.shortIndexInRange 1027 thrpt 2 8532.394 ops/ms
Thanks for the performance testing on x86 systems! Agree that a separate PR is fine, and I will address it once this PR merged.
-------------
PR: https://git.openjdk.org/jdk/pull/12064
More information about the core-libs-dev
mailing list