[vectorIntrinsics+mask] RFR: 8273057: [vector] New VectorAPI "SelectiveStore"

Sat Sep 4 18:48:19 UTC 2021

On Sep 3, 2021, at 1:11 PM, Paul Sandoz <paul.sandoz at oracle.com<mailto:paul.sandoz at oracle.com>> wrote:

I can see how a prefix-sum might behave with a mask describing the selection of lanes *and* compression of the result (no intermediate values, either from the input or zero).

Prefix sums (aka scans) are surprisingly powerful.

 - scans can use any associative operator (not just sum:  min, max, first-non-zero, etc.)
 - scans can be implemented in log time (reduction “up-sweep” and computation “down-sweep” on lane spanning tree)
 - come in “inclusive” and “exclusive” form (aka Hillis/Steele and Blelloch; Blelloch calls his a “prescan”)
 - scale naturally to arbitrary lengths
 - partition naturally into vectorizable sub-lengths (N.B. requires carry-in and carry-out feature!)
 - provide a good primitive for “nested SIMD” [1]
 - often provide useful “steering” information for subsequent loads/stores or shuffles/anti-shuffles [2]

[1] To perform a data-parallel/SIMD operation simultaneously on N SIMD data sets {S[i]}, first concatenate them all into a single SIMD data set G (unravel an array into a stream of values, for example), keeping elements in the same original data set contiguous in the new data set, and add a new boolean vector, a “boundary mask” which marks boundaries of the original data sets within the new data set.  (This does not need a group field in [1..N], just a boolean, because groups are contiguous in the big set.)  Do SIMD on the whole thing.  When performing scans or reduces, use the segmented variation of the scan or reduce, which consults the boundary mask and prevents carry-ins and carry-outs across boundaries.  Reductions and carry-outs from up-sweeps go into sparse SIMD data sets which can be quickly compressed to N-vectors, carrying the per-group results.  Collecting per-group results is where compress shines.  The procedure outlined here is very robust across group sizes:  It basically works the same, and with the same efficiency, whether you N=1 or N=|G| and regardless of the statistics of the group sizes |S[i]|.

[2] When you have an ordering problem, like vectorized sort or compress, look first for a scan pre-pass that could use to steer a data-movement pass.  I find this often clarifies the problem and suggests new vectorization opportunities.