[vectorIntrinsics+mask] RFR: 8273057: [vector] New VectorAPI "SelectiveStore"

Tue Sep 7 09:41:54 UTC 2021

On Fri, 27 Aug 2021 09:47:10 GMT, Joshua Zhu <jzhu at openjdk.org> wrote:

> Hi,
> 
> I want to propose a new VectorAPI "Selective Store/Load" and share my
> implementation. Currently Alibaba's internal databases are in the
> process of applying VectorAPI and they have requirements on "Selective
> Store" for acceleration.
> 
> My proposed VectorAPI is declared as below [1]:
> 
>     int selectiveIntoArray($type$[] a, int offset, VectorMask<$Boxtype$> m);
> 
> The active elements (with their respective bit set in mask) are
> contiguously stored into the array "a". Assume N is the true count of
> mask, the elements starting from a[offset+N] till a[offset+laneCount]
> are left unchanged. The return value represents the number of elements
> store into the array and "offset + return value" is the new offset of
> the next iteration.
> ![image](https://user-images.githubusercontent.com/70769035/131108509-3dcb61f3-e8d0-4b4e-9b49-a72c077aaba6.png)
> This API will be used like the following manner [2]:
> 
>     tld.conflict_cnt = 0;
>     for (int i = 0; i < ARRAY_LENGTH; i += INT_PREFERRED_SPECIES.length()) {
>       IntVector av = IntVector.fromArray(INT_PREFERRED_SPECIES, tld.int_input1, i);
>       IntVector bv = IntVector.fromArray(INT_PREFERRED_SPECIES, tld.int_input2, i);
>       IntVector cv = IntVector.fromArray(INT_PREFERRED_SPECIES, tld.int_index, i);
>       VectorMask<Integer> mask = av.compare(VectorOperators.NE, bv);
>       tld.conflict_cnt += cv.selectiveIntoArray(tld.conflict_array, tld.conflict_cnt, mask);
>     }
> 
> My patch includes the following changes:
>   * Selective Store VectorAPI for Long & Int
>   * Assembler: add x86 instruction "VPCOMPRESSD" and "VPCOMPRESSQ"
>   * Instruction selection: vselective_store; kmask_truecount (true count of kregister)
>   * Add node "StoreVectorSelective"
>   * Add a new parameter "is_selective" in inline_vector_mem_masked_operation()
>     in order to distinguish Masked version or Selective version
>   * jtreg cases
>   * JMH benchmark
>       
> TODO parts I will implement:
>   * Selective Store for other types
>   * Selective Load
>   * Some potential optimization. Such as: when mask is allTrue, SelectiveIntoArray() -> IntoArray()
> 
> Test:
>   * Passed VectorAPI jtreg cases.
>   * Result of JMH benchmark to evaluate API's performance in Alibaba's real scenario.
>       UseAVX=3; thread number = 8; conflict data percentage: 20% (that means 20% of mask bits are true)
>       http://cr.openjdk.java.net/~jzhu/8273057/jmh_benchmark_result.pdf
> 
> [1] https://github.com/JoshuaZhuwj/panama-vector/commit/69623f7d6a1eae532576359328b96162d8e16837#diff-13cc2d6ec18e487ddae05cda671bdb6bb7ffd42ff7bc51a2e00c8c5e622bd55dR4667
> [2] https://github.com/JoshuaZhuwj/panama-vector/commit/69623f7d6a1eae532576359328b96162d8e16837#diff-951d02bd72a931ac34bc85d1d4e656a14f8943e143fc9282b36b9c76c1893c0cR144
> [3] failed to inline (intrinsic) by https://github.com/openjdk/panama-vector/blob/60aa8ca6dc0b3f1a3ee517db167f9660012858cd/src/hotspot/cpu/x86/x86.ad#L1769
> 
> Best Regards,
> Joshua

> While adding new macro level APIs is appealing, we can also extend following existing vectorAPIs to accept another boolean flag "is_selective" under which compression/expansion triggers.
> 
> ```
> public static IntVector fromArray(VectorSpecies<Integer> species,
> int[] a,
> int offset,
> VectorMask<Integer> m) 
> 
> 
> public final void intoArray(int[] a,
> int offset,
> VectorMask<Integer> m)
> ```

Per design discussion in this thread, compared to vector-to-memory operation, vector-to-vector compress/expand operation is the more friendly primitive.
It can also be used to "bridge to and from permutation simply by working with index vectors like iota, and perhaps (as sugar) lifting selected vector operations to shuffles."
For different architectures, like SVE, memory destination version is also not supported natively.

> In this use case its difficult to infer COMPRESSION through Auto-vectorizer though we made attempts in past to infer complex loop patterns for VNNI instruction.

Could you elaborate on it please? I do not follow this.

> This way we can also share common optimizations as you suggested earlier to convert masked COMPRESS to unmasked vector move for ALLTRUE mask, some work[1][2] is already in place on this front.
> 
> [1] https://github.com/openjdk/panama-vector/blob/master/src/hotspot/share/opto/vectornode.cpp#L752
> [2] https://github.com/openjdk/panama-vector/blob/master/src/hotspot/share/opto/vectornode.cpp#L771

Yes. Since compress/expand op is also mask-based, this piece of optimization is common. Maybe we can think of one way to share this optimization for different kinds of masked operations?

-------------

PR: https://git.openjdk.java.net/panama-vector/pull/115