VectorShuffle performance & usability feedback

Thu Jan 26 20:50:21 UTC 2023

Hi Dirk,

Thanks for your feedback on an interesting use-case.

- I think you can avoid reflection by doing this:

  VectorSpecies<Float> floatSpecies = FloatVector.SPECIES_PREFERRED;
  VectorShape byteShape = VectorShape.forBitSize(floatSpecies.length() * 8); // <- throws if shape is not available
  VectorSpecies<Byte> byteSpecies = VectorSpecies.of(byte.class, byteShape);
  assert byteSpecies.length() == floatSpecies.length();

-The integral lane element values of a VectorShuffle are in effect “clamped" and those of an IntVector are not, and further its not possible to represent an IntVector holding the required number of lanes for indexing lane elements for all shapes of ByteVector.
(We also have some trouble with the internal representation of VectorShuffle for large vector sizes, such as that may be applicable in ARM AVE, although in practice are likely very rare.)

- So far we have avoided adding the set of integral operations on VectorShuffle (and dealing with the interval constraints) since its possible to lean into conversions, but unfortunately you are running into some performance pot holes. The VectorShuffle::cast methods is not yet optimized, sorry. The VectorShuffle::toVector is, but is alas of no use to you for this use-case. 

- Your point about "If a JRE does not support vectorization” is well taken. Note that querying the preferred species will give a species whose vector size is supported on the platform, but there is no way to query if a particular species is optimally supported and if so what operations are also supported. Ideally we would prefer developers not have to reason about this and for the implementation to fall back efficiently, but that is not always practical, especially for a low-level API, so some sort of querying may be necessary. IMO I don’t loop bound queries are the right place to surface this feature.

- Species is the factory for the VM to define both the vector size and the lane element type, from that the number of lanes is known and the VM knows what instructions to generate. I think what you are suggesting is that a VectorShuffle be just lane specific with a species lane type (I suppose the same could apply to masks). Any binding to a species with the required number of lanes would be an implementation detail. Any lane-based VectorShuffle could be used with vectors whose species have the same number of lanes. I would need to discuss the possibilities with other team members, but it's potentially a non-trivial change.
(If we can fix the VectorShuffle cast performance pothole then this really comes down to an API design question, assuming it's possible for the VM to easily optimize.)

Overall VectorShuffle needs a little more TLC.

Sandhya, I think it is worth looking at the code generated from the benchmark for the fftV2 checking if there are any dangling optimization issues (e.g. constant loads from the VectorShuffle arrays.)

Paul.  

> On Jan 12, 2023, at 1:53 PM, Dirk Toewe <dirktoewe at gmail.com> wrote:
> 
> Dear JDK/Panama Devs,
> 
> In order to get to know the jdk.incubator.vector API, I have been tinkering with the vectorization of the complex fast-Fourier transform (FFT) of tiny float arrays of length FloatVector.SPECIES_PREFERRED.length(). In the process, I have encountered some performance and usability issues which I want to share, in the hopes these might be addressed in a future iteration of the Vector API. My perspective is purely one of an API user, since I have no knowledge of the underlying AVX/SSE instruction sets. Actual knowledge of the FFT should not be necessary for this discussion. The performance and usability issues come down to a single line of code which I will get to in a second.
> 
> As an experiment, I have taken a standard FFT implementation (without bit-reversal permutation) fftV1 and converted it into a vectorized version fftV2 which performs consecutive rearrangements and fma operatons using precomputed VectorShuffles and FloatVectors. And fair enough, according to my benchmarks, the vectorized version takes roughly 40% less time using AVX2 on an AMD R9 3900X.
> 
> Next, I have tried to be "clever". In a second vectorized variant fftV3, every second VectorShuffle is computed from every first of two VectorShuffles. In (my) theory, this should reduce the amount of cache required. Transforming a VectorShuffle, however, turned out much more cumbersome than I had hoped. The cleanest way I could come up with was the following:
> 	• Find a VectorSpecies<Byte> with the same lane count as FloatVector.SPECIES_PREFERRED
> 	• Cast the VectorShuffle to said species
> 	• Convert using VectorShuffle::toVector
> 	• Apply the transformation
> 	• Convert back using ByteVector.toShuffle()
> 	• Cast VectorShuffle<Byte> back to VectorShuffle<Double>
> The line of code in question can be found here. I have tried many other implementations of said line of code, all of which have been even more spaghetti-ish. All variants had one thing in common: Terrible performance. fftV3 takes twice as long to compute as the non-vectorized fftV1. My experience with these usability and performance issues lead me to the following suggestions/questions:
> 
> 1) VectorShuffle should behave like or even be an IntVector
> In my mind, a VectorShuffle is nothing but a vector of indices. It would greatly simplify the API if it just were an IntVector or a ShortVector. If that is not possible, it would be great if it would at least support lanewise integral operations like ADD, SUB, OR, XOR,...
> 
> 2) Improve species selection
> The aforementioned spaghetti code required me to find a a VectorSpecies<Byte> with the same lane count as FloatVector.SPECIES_PREFERRED. To achieve that, I had to resort to reflection, which is never a good sign. Additional methods for species selection would be really helpful, the following come to mind:
> 	• Add a static methods along the lines of FloatVector.speciesWithLaneCount(int)
> 	• Add a conversion methods to each species, something like VectorSpecies<Float>::toIntSpecies()
> 	• Add static methods, e.g. FloatVector.availableSpecies(), that list all available species
> 3) Why is VectorSpecies type-dependent?
> As my example above demonstrates, it is sometimes necessary to work with different vector types of the same lane count. If VectorSpecies was not type-dependent, different types could simply be handled by a single species. This would also make VectorShuffle type-independent. VectorShuffle::toVector could always return an IntVector instead of e.g. VectorShuffle<Double>::toVector returning a DoubleVector which is, in my opinion, unintuitive.
> 
> 4) Add VectorSpecies::loopBoundOptimized(int)
> On an unrelated note: VectorSpecies::loopBound(int) always returns the largest possible bound. While processing 9 elements with an 8-lane species, zero might still be the best loop bound. If a JRE does not support vectorization, zero might in fact always be the best loop bound. A method along the lines of VectorSpecies::loopBoundOptimized(int) could allow the runtime more flexibility to determine an optimal loop bound, maybe even based on runtime benchmarks.
> 
> Hopefully, these suggestions and questions are constructive
> 
> Yours sincerely
> Dirk