RFR: 8284050: [vectorapi] Optimize masked store for non-predicated architectures [v2]
Xiaohong Gong
xgong at openjdk.java.net
Thu May 5 03:29:16 UTC 2022
On Thu, 5 May 2022 02:09:39 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:
>> Currently the vectorization of masked vector store is implemented by the masked store instruction only on architectures that support the predicate feature. The compiler will fall back to the java scalar code for non-predicate supported architectures like ARM NEON. However, for these systems, the masked store can be vectorized with the non-masked vector `"load + blend + store"`. For example, storing a vector` "v"` controlled by a mask` "m"` into a memory with address` "addr" (i.e. "store(addr, v, m)")` can be implemented with:
>>
>>
>> 1) mem_v = load(addr) ; non-masked load from the same memory
>> 2) v = blend(mem_v, v, m) ; blend with the src vector with the mask
>> 3) store(addr, v) ; non-masked store into the memory
>>
>>
>> Since the first full loading needs the array offset must be inside of the valid array bounds, we make the compiler do the vectorization only when the offset is in range of the array boundary. And the compiler will still fall back to the java scalar code if not all offsets are valid. Besides, the original offset check for masked lanes are only applied when the offset is not always inside of the array range. This also improves the performance for masked store when the offset is always valid. The whole process is similar to the masked load API.
>>
>> Here is the performance data for the masked vector store benchmarks on a X86 non avx-512 system, which improves about `20x ~ 50x`:
>>
>> Benchmark before after Units
>> StoreMaskedBenchmark.byteStoreArrayMask 221.733 11094.126 ops/ms
>> StoreMaskedBenchmark.doubleStoreArrayMask 41.086 1034.408 ops/ms
>> StoreMaskedBenchmark.floatStoreArrayMask 73.820 1985.015 ops/ms
>> StoreMaskedBenchmark.intStoreArrayMask 75.028 2027.557 ops/ms
>> StoreMaskedBenchmark.longStoreArrayMask 40.929 1032.928 ops/ms
>> StoreMaskedBenchmark.shortStoreArrayMask 135.794 5307.567 ops/ms
>>
>> Similar performance gain can also be observed on ARM NEON system.
>>
>> And here is the performance data on X86 avx-512 system, which improves about `1.88x - 2.81x`:
>>
>> Benchmark before after Units
>> StoreMaskedBenchmark.byteStoreArrayMask 11185.956 21012.824 ops/ms
>> StoreMaskedBenchmark.doubleStoreArrayMask 1480.644 3911.720 ops/ms
>> StoreMaskedBenchmark.floatStoreArrayMask 2738.352 7708.365 ops/ms
>> StoreMaskedBenchmark.intStoreArrayMask 4191.904 9300.428 ops/ms
>> StoreMaskedBenchmark.longStoreArrayMask 2025.031 4604.504 ops/ms
>> StoreMaskedBenchmark.shortStoreArrayMask 8339.389 17817.128 ops/ms
>>
>> Similar performance gain can also be observed on ARM SVE system.
>
> Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains one commit:
>
> 8284050: [vectorapi] Optimize masked store for non-predicated architectures
> _Mailing list message from [Hans Boehm](mailto:hboehm at google.com) on [hotspot-dev](mailto:hotspot-dev at mail.openjdk.java.net):_
>
> Naive question: What happens if one of the vector elements that should not have been updated is concurrently being written by another thread? Aren't you generating writes to vector elements that should not have been written?
>
> Hans
>
> On Wed, May 4, 2022 at 7:08 PM Xiaohong Gong <xgong at openjdk.java.net> wrote:
Yeah, this is the similar concern with what @rose00 mentioned above. The current solution cannot work well for multi-thread progresses. I will consider other better solutions. Thanks for the comments!
src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java line 3483:
> 3481: ByteSpecies vsp = vspecies();
> 3482: if (offset >= 0 && offset <= (a.length - vsp.length())) {
> 3483: intoBooleanArray0(a, offset, m, /* offsetInRange */ true);
The offset check could save the `checkMaskFromIndexSize` for cases that offset are in the valid array bounds, which also improves the performance. @rose00 , do you think this part of change is ok at least? Thanks!
-------------
PR: https://git.openjdk.java.net/jdk/pull/8544
More information about the hotspot-compiler-dev
mailing list