RFR: 8284050: [vectorapi] Optimize masked store for non-predicated architectures

Thu May 5 03:15:10 UTC 2022

Naive question: What happens if one of the vector elements that should not
have been updated is concurrently being written by another thread? Aren't
you generating writes to vector elements that should not have been written?

Hans

On Wed, May 4, 2022 at 7:08 PM Xiaohong Gong <xgong at openjdk.java.net> wrote:

> Currently the vectorization of masked vector store is implemented by the
> masked store instruction only on architectures that support the predicate
> feature. The compiler will fall back to the java scalar code for
> non-predicate supported architectures like ARM NEON. However, for these
> systems, the masked store can be vectorized with the non-masked vector
> `"load + blend + store"`. For example, storing a vector` "v"` controlled by
> a mask` "m"` into a memory with address` "addr" (i.e. "store(addr, v, m)")`
> can be implemented with:
>
>
>  1) mem_v = load(addr)     ; non-masked load from the same memory
>  2) v = blend(mem_v, v, m) ; blend with the src vector with the mask
>  3) store(addr, v)         ; non-masked store into the memory
>
>
> Since the first full loading needs the array offset must be inside of the
> valid array bounds, we make the compiler do the vectorization only when the
> offset is in range of the array boundary. And the compiler will still fall
> back to the java scalar code if not all offsets are valid. Besides, the
> original offset check for masked lanes are only applied when the offset is
> not always inside of the array range. This also improves the performance
> for masked store when the offset is always valid. The whole process is
> similar to the masked load API.
>
> Here is the performance data for the masked vector store benchmarks on a
> X86 non avx-512 system, which improves about `20x ~ 50x`:
>
> Benchmark                                  before    after   Units
> StoreMaskedBenchmark.byteStoreArrayMask   221.733  11094.126 ops/ms
> StoreMaskedBenchmark.doubleStoreArrayMask  41.086   1034.408 ops/ms
> StoreMaskedBenchmark.floatStoreArrayMask   73.820   1985.015 ops/ms
> StoreMaskedBenchmark.intStoreArrayMask     75.028   2027.557 ops/ms
> StoreMaskedBenchmark.longStoreArrayMask    40.929   1032.928 ops/ms
> StoreMaskedBenchmark.shortStoreArrayMask  135.794   5307.567 ops/ms
>
> Similar performance gain can also be observed on ARM NEON system.
>
> And here is the performance data on X86 avx-512 system, which improves
> about `1.88x - 2.81x`:
>
> Benchmark                                  before     after   Units
> StoreMaskedBenchmark.byteStoreArrayMask   11185.956 21012.824 ops/ms
> StoreMaskedBenchmark.doubleStoreArrayMask  1480.644  3911.720 ops/ms
> StoreMaskedBenchmark.floatStoreArrayMask   2738.352  7708.365 ops/ms
> StoreMaskedBenchmark.intStoreArrayMask     4191.904  9300.428 ops/ms
> StoreMaskedBenchmark.longStoreArrayMask    2025.031  4604.504 ops/ms
> StoreMaskedBenchmark.shortStoreArrayMask   8339.389 17817.128 ops/ms
>
> Similar performance gain can also be observed on ARM SVE system.
>
> -------------
>
> Depends on: https://git.openjdk.java.net/jdk/pull/8035
>
> Commit messages:
>  - 8284050: [vectorapi] Optimize masked store for non-predicated
> architectures
>  - 8283667: [vectorapi] Vectorization for masked load with IOOBE with
> predicate feature
>
> Changes: https://git.openjdk.java.net/jdk/pull/8544/files
>  Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8544&range=00
>   Issue: https://bugs.openjdk.java.net/browse/JDK-8284050
>   Stats: 1708 lines in 44 files changed: 710 ins; 188 del; 810 mod
>   Patch: https://git.openjdk.java.net/jdk/pull/8544.diff
>   Fetch: git fetch https://git.openjdk.java.net/jdk
> pull/8544/head:pull/8544
>
> PR: https://git.openjdk.java.net/jdk/pull/8544
>