[vectorIntrinsics+mask] RFR: 8273057: [vector] New VectorAPI "SelectiveStore"

Sat Sep 4 18:56:11 UTC 2021

P.S. Some googly references that seem useful for me:

https://en.wikipedia.org/wiki/Prefix_sum
https://www.cs.princeton.edu/courses/archive/fall21/cos326/lec/21-02-parallel-prefix-scan.pdf
https://www.cs.cmu.edu/~guyb/papers/Ble93.pdf (from Connection Machine days, still relevant if you translate terms)

You can find your own easily, of course.  I suppose there are plenty of GPU people who have rediscovered this stuff recently.  Appel traces the basics back to 1977.

So here’s a basic tool for our toolkit:   Watch out for segmented scans and reductions, even in disguise (say, as nested or grouped parallelism).  Use them to turn brute-force iteration into log-N data parallel operations.  (Will the hardware reward your rewrite of the algorithm?  One may hope…  Sometimes it does.)