[vectorIntrinsics+mask] RFR: 8273057: [vector] New VectorAPI "SelectiveStore"

Mon Aug 30 09:04:00 UTC 2021

On Fri, 27 Aug 2021 09:47:10 GMT, Joshua Zhu <jzhu at openjdk.org> wrote:

> Hi,
> 
> I want to propose a new VectorAPI "Selective Store/Load" and share my
> implementation. Currently Alibaba's internal databases are in the
> process of applying VectorAPI and they have requirements on "Selective
> Store" for acceleration.
> 
> My proposed VectorAPI is declared as below [1]:
> 
>     int selectiveIntoArray($type$[] a, int offset, VectorMask<$Boxtype$> m);
> 
> The active elements (with their respective bit set in mask) are
> contiguously stored into the array "a". Assume N is the true count of
> mask, the elements starting from a[offset+N] till a[offset+laneCount]
> are left unchanged. The return value represents the number of elements
> store into the array and "offset + return value" is the new offset of
> the next iteration.
> ![image](https://user-images.githubusercontent.com/70769035/131108509-3dcb61f3-e8d0-4b4e-9b49-a72c077aaba6.png)
> This API will be used like the following manner [2]:
> 
>     tld.conflict_cnt = 0;
>     for (int i = 0; i < ARRAY_LENGTH; i += INT_PREFERRED_SPECIES.length()) {
>       IntVector av = IntVector.fromArray(INT_PREFERRED_SPECIES, tld.int_input1, i);
>       IntVector bv = IntVector.fromArray(INT_PREFERRED_SPECIES, tld.int_input2, i);
>       IntVector cv = IntVector.fromArray(INT_PREFERRED_SPECIES, tld.int_index, i);
>       VectorMask<Integer> mask = av.compare(VectorOperators.NE, bv);
>       tld.conflict_cnt += cv.selectiveIntoArray(tld.conflict_array, tld.conflict_cnt, mask);
>     }
> 
> My patch includes the following changes:
>   * Selective Store VectorAPI for Long & Int
>   * Assembler: add x86 instruction "VPCOMPRESSD" and "VPCOMPRESSQ"
>   * Instruction selection: vselective_store; kmask_truecount (true count of kregister)
>   * Add node "StoreVectorSelective"
>   * Add a new parameter "is_selective" in inline_vector_mem_masked_operation()
>     in order to distinguish Masked version or Selective version
>   * jtreg cases
>   * JMH benchmark
>       
> TODO parts I will implement:
>   * Selective Store for other types
>   * Selective Load
>   * Some potential optimization. Such as: when mask is allTrue, SelectiveIntoArray() -> IntoArray()
> 
> Test:
>   * Passed VectorAPI jtreg cases.
>   * Result of JMH benchmark to evaluate API's performance in Alibaba's real scenario.
>       UseAVX=3; thread number = 8; conflict data percentage: 20% (that means 20% of mask bits are true)
>       http://cr.openjdk.java.net/~jzhu/8273057/jmh_benchmark_result.pdf
> 
> [1] https://github.com/JoshuaZhuwj/panama-vector/commit/69623f7d6a1eae532576359328b96162d8e16837#diff-13cc2d6ec18e487ddae05cda671bdb6bb7ffd42ff7bc51a2e00c8c5e622bd55dR4667
> [2] https://github.com/JoshuaZhuwj/panama-vector/commit/69623f7d6a1eae532576359328b96162d8e16837#diff-951d02bd72a931ac34bc85d1d4e656a14f8943e143fc9282b36b9c76c1893c0cR144
> [3] failed to inline (intrinsic) by https://github.com/openjdk/panama-vector/blob/60aa8ca6dc0b3f1a3ee517db167f9660012858cd/src/hotspot/cpu/x86/x86.ad#L1769
> 
> Best Regards,
> Joshua

Hi Paul,

Thanks a lot for your quick reply and share of design thinking.

> We already have general mask accepting scatter/gather store/load. (I have always been a bit uncertain whether we have the right signatures for these methods, and whether they are necessary if we can use shuffles.)
> 
> To use the scatter store method today for your use-case we would have to:
> 
> - compute an int[] array from the set lanes of the mask, M say
> - computing the ?prefix" mask from the set lanes of M, PM
> - store the vector using the int[] array and PM.
> 
> Another alternative is to:
> 
> - compute a ?compression? shuffle, S, from the set lanes of the mask, M
> - apply S to the vector, produce a compressed vector CV
> - computing the ?prefix" mask from the set lanes of M, PM
> - store CV using M

In my opinion, no matter the first scatter store way or the second shuffle way,
the key to resolve this use case is similar: how to generate "compression" shuffle or "compression" index array according to the set lanes of the mask.
(The "prefix" mask could be computed by true count of mask.)
But when compose existing primitives, how much performance is affected is also important compared to native instruction support or scalar version.

When I help database developers to apply VectorAPI, they usually care:
  *how much performance they could gain;
  *VectorAPI's ease of use to reduce code complexity and ensure portability;
  *whether intrinsics are inlined in their scenarios (fail cases bring performance degradation).

Four vectorized data movement primitives: selective load, selective store, gather, and scatter
are critical SIMD operations for in-memory databases.
Currently gather and scatter are supported in VectorAPI.
That's why I help them implement selective ops and try to propose "Selective Store/Load" in VectorAPI.

> The primitive I am searching for might be a way to create a shuffle from a mask.

What I know for creating shuffle from mask is to leverage a precomputed, cache resident table.
Take four bits mask as example,

mask_index | mask |  index_array
    0      | 0000 |  
    1      | 0001 |  3
    2      | 0010 |  2
    3      | 0011 |  2, 3
    4      | 0100 |  1
    5      | 0101 |  1, 3
   ...
    14     | 1110 |  0, 1, 2
    15     | 1111 |  0, 1, 2, 3

The index_array could be treated as index map or converted into shuffle.

> Let?s say we could write:
> 
> Int[] a = ...
> IntVector v = ...
> VectorMask m = ?
> 
> // The new primitive, create a shuffle from the mask that partitions vector elements
> // according to the set and unset mask lane elements.
> VectorShuffle s = m.toPartitioningShuffle();
> // Partition the elements
> IntVector cv = v.rearrange(s);
> 
> // This method is likely not optimal, yet!
> // Another method could be added that is prefix(int length)
> VectorMask pm = m.species().indexInRange(0, m.trueCount());
> 
> // Use existing masked store
> cv.intoArray(a, offset, pm);
> // Increase offset by number of stores
> offset += m.trueCount();
> 
> Is it possible for C2 to detect the kind of shuffle pattern and masking to sufficiently optimize? Experts please chime in!

I think the pattern will be too complicated and the optimization may be not reliable.

Thanks,
Joshua

-------------

PR: https://git.openjdk.java.net/panama-vector/pull/115