[vectorIntrinsics+mask] RFR: 8273057: [vector] New VectorAPI "SelectiveStore"

Thu Sep 2 16:50:10 UTC 2021

Hi Joshua,

I think we still have some exploring to do on the design, and for others to comment esp. with regards to C2 capabilities.

Here is another design alternative between the spectrum of a partitioning/compressing a shuffle from mask [*] and a compress method:

  VectorMask<Integer> mask = ...;
  IntVector bv = av.rearrange(VectorOperator.COMPRESS, mask);
  VectorMask<Integer> prefixMask = prefix(mask.trueCount());
  bv.intoArray(array, offset, prefixMask);
  offset += mask.trueCount();

We introduce a new kind of operator, Rearrange, and constants, such as VectorOperator.COMPRESS that specifies behavior of the non-mask and mask accepting rearrange methods. COMPRESS specifies that:

1) the non-mask rearrange is an identity operation; and
2) the mask accepting rearrange describes mask-based cross-lane movement:

It should be possible to create a shuffle from a Rearrange operator, with and without a mask so the equivalent functionality can be applied to a shuffle accepting rearrange e.g, for COMPRESS:

  rearrange(Shuffle.fromRearrangeOp(COMPRESS, mask), mask.prefix())
  Or
  rearrange(Shuffle.fromRearrangeOp(COMPRESS, mask), zero())
  // Postfix of exceptional lanes in the shuffle, representing unset lanes 

For this to be of benefit we would need to come up with other realistic Rearrange operators, even if we do not add them right now e.g. IOTA, REVERSE, PARTITION_TRUE, PARTITION_FALSE, INTERLEAVE.

However, the design is a little awkward since the mask may or may not contribute to cross-lane movement, and so the operator needs to state the equivalence.

In effect the Rearrange operator is a mechanism to refer to certain kinds of shuffle as a constant. Ideally I would still prefer if we could implicitly identify what would otherwise be rearrange operators based on the creation of shuffles with known content e.g. can C2 somehow tag a shuffle instance with an ID of COMPRESS with a dependency on the mask used for its creation?

—

FWIW another way to think about a partitioning/compression shuffle:

  SPECIES.iota().compress(m);

Which is just specific way of shuffling a shuffle. We could actually track the kinds of shuffle as final fields of the VectorShuffle implementation.

Paul. 

[*] We could consider this independently

> On Sep 1, 2021, at 4:55 AM, Joshua Zhu <jzhu at openjdk.java.net> wrote:
> 
> On Fri, 27 Aug 2021 09:47:10 GMT, Joshua Zhu <jzhu at openjdk.org> wrote:
> 
>> Hi,
>> 
>> I want to propose a new VectorAPI "Selective Store/Load" and share my
>> implementation. Currently Alibaba's internal databases are in the
>> process of applying VectorAPI and they have requirements on "Selective
>> Store" for acceleration.
>> 
>> My proposed VectorAPI is declared as below [1]:
>> 
>>    int selectiveIntoArray($type$[] a, int offset, VectorMask<$Boxtype$> m);
>> 
>> The active elements (with their respective bit set in mask) are
>> contiguously stored into the array "a". Assume N is the true count of
>> mask, the elements starting from a[offset+N] till a[offset+laneCount]
>> are left unchanged. The return value represents the number of elements
>> store into the array and "offset + return value" is the new offset of
>> the next iteration.
>> ![image](https://user-images.githubusercontent.com/70769035/131108509-3dcb61f3-e8d0-4b4e-9b49-a72c077aaba6.png)
>> This API will be used like the following manner [2]:
>> 
>>    tld.conflict_cnt = 0;
>>    for (int i = 0; i < ARRAY_LENGTH; i += INT_PREFERRED_SPECIES.length()) {
>>      IntVector av = IntVector.fromArray(INT_PREFERRED_SPECIES, tld.int_input1, i);
>>      IntVector bv = IntVector.fromArray(INT_PREFERRED_SPECIES, tld.int_input2, i);
>>      IntVector cv = IntVector.fromArray(INT_PREFERRED_SPECIES, tld.int_index, i);
>>      VectorMask<Integer> mask = av.compare(VectorOperators.NE, bv);
>>      tld.conflict_cnt += cv.selectiveIntoArray(tld.conflict_array, tld.conflict_cnt, mask);
>>    }
>> 
>> My patch includes the following changes:
>>   * Selective Store VectorAPI for Long & Int
>>   * Assembler: add x86 instruction "VPCOMPRESSD" and "VPCOMPRESSQ"
>>   * Instruction selection: vselective_store; kmask_truecount (true count of kregister)
>>   * Add node "StoreVectorSelective"
>>   * Add a new parameter "is_selective" in inline_vector_mem_masked_operation()
>>     in order to distinguish Masked version or Selective version
>>   * jtreg cases
>>   * JMH benchmark
>> 
>> TODO parts I will implement:
>>   * Selective Store for other types
>>   * Selective Load
>>   * Some potential optimization. Such as: when mask is allTrue, SelectiveIntoArray() -> IntoArray()
>> 
>> Test:
>>   * Passed VectorAPI jtreg cases.
>>   * Result of JMH benchmark to evaluate API's performance in Alibaba's real scenario.
>>       UseAVX=3; thread number = 8; conflict data percentage: 20% (that means 20% of mask bits are true)
>>       http://cr.openjdk.java.net/~jzhu/8273057/jmh_benchmark_result.pdf
>> 
>> [1] https://github.com/JoshuaZhuwj/panama-vector/commit/69623f7d6a1eae532576359328b96162d8e16837#diff-13cc2d6ec18e487ddae05cda671bdb6bb7ffd42ff7bc51a2e00c8c5e622bd55dR4667
>> [2] https://github.com/JoshuaZhuwj/panama-vector/commit/69623f7d6a1eae532576359328b96162d8e16837#diff-951d02bd72a931ac34bc85d1d4e656a14f8943e143fc9282b36b9c76c1893c0cR144
>> [3] failed to inline (intrinsic) by https://github.com/openjdk/panama-vector/blob/60aa8ca6dc0b3f1a3ee517db167f9660012858cd/src/hotspot/cpu/x86/x86.ad#L1769
>> 
>> Best Regards,
>> Joshua
> 
> Thank Paul, John Rose and Ningsheng for your comments.
> 
> A vector-to-vector compress operation is the more friendly primitive.
> Both Intel AVX3 instruction "COMPRESS" and Arm SVE instruction "COMPACT" provide the capability.
> Hence selective store could be implemented by
> 
>    VectorMask<Integer> mask = ...;
>    IntVector bv = av.compress(mask);
>    VectorMask<Integer> prefixMask = prefix(mask.trueCount());
>    bv.intoArray(array, offset, prefixMask);
>    offset += mask.trueCount();
> 
> The vector-to-vector compress primitive together with a store and prefix mask could be optimized further into memory destination version of compress on supported architecture.
> 
> Once architectures do not support vector-to-vector compress natively,
> the intrinsic bailed out and then java path is taken.
> Java path could be composed by "mask -> shuffle -> rearrange" and C2 will then try to inline intrinsics again for these API calls.
> In this way, we are also able to implement scatter/gather store/load in the same composition manner to keep design consistency.
> 
> Best Regards,
> Joshua
> 
> -------------
> 
> PR: https://git.openjdk.java.net/panama-vector/pull/115