VectorShuffle performance & usability feedback

Thu Jan 12 21:53:09 UTC 2023

Dear JDK/Panama Devs,

In order to get to know the *jdk.incubator.vector* API, I have been
tinkering with the vectorization of the complex fast-Fourier transform
(FFT) of tiny float arrays of length
*FloatVector.SPECIES_PREFERRED.length()*. In the process, I have
encountered some performance and usability issues which I want to share, in
the hopes these might be addressed in a future iteration of the Vector API.
My perspective is purely one of an API user, since I have no knowledge of
the underlying AVX/SSE instruction sets. Actual knowledge of the FFT should
not be necessary for this discussion. The performance and usability issues
come down to a single line of code which I will get to in a second.

As an experiment, I have taken a standard FFT implementation (without
bit-reversal permutation) fftV1
<https://github.com/DirkToewe/tiny_vectorized_fft/blob/master/src/main/java/tiny_vectorized_fft/TinyFFT.java#L81-L112>
and
converted it into a vectorized version fftV2
<https://github.com/DirkToewe/tiny_vectorized_fft/blob/master/src/main/java/tiny_vectorized_fft/TinyFFT.java#L114-L136>
which
performs consecutive rearrangements and fma operatons using precomputed
*VectorShuffles* and *FloatVectors*. And fair enough, according to my
benchmarks
<https://github.com/DirkToewe/tiny_vectorized_fft/blob/master/src/main/java/tiny_vectorized_fft/TinyFFT_Benchmark.java>,
the vectorized version takes roughly 40% less time using AVX2 on an AMD R9
3900X.

Next, I have tried to be "clever". In a second vectorized variant fftV3
<https://github.com/DirkToewe/tiny_vectorized_fft/blob/master/src/main/java/tiny_vectorized_fft/TinyFFT.java#L138-L160>,
every second *VectorShuffle *is computed from every first of two
*VectorShuffles*. In (my) theory, this should reduce the amount of cache
required. Transforming a *VectorShuffle*, however, turned out much more
cumbersome than I had hoped. The cleanest way I could come up with was the
following:

   - Find a *VectorSpecies<Byte>* with the same lane count as
   *FloatVector.SPECIES_PREFERRED*
   - Cast the *VectorShuffle *to said species
   - Convert using *VectorShuffle::toVector*
   - Apply the transformation
   - Convert back using ByteVector*.toShuffle()*
   - Cast *VectorShuffle<Byte> *back to *VectorShuffle<Double>*

The line of code in question can be found here
<https://github.com/DirkToewe/tiny_vectorized_fft/blob/master/src/main/java/tiny_vectorized_fft/TinyFFT.java#L147>.
I have tried many other implementations of said line of code, all of which
have been even more spaghetti-ish. All variants had one thing in common:
Terrible performance. fftV3 takes twice as long to compute as the
non-vectorized fftV1. My experience with these usability and performance
issues lead me to the following suggestions/questions:

*1) VectorShuffle should behave like or even be an IntVector*
In my mind, a *VectorShuffle* is nothing but a vector of indices. It would
greatly simplify the API if it just were an *IntVector* or a *ShortVector*.
If that is not possible, it would be great if it would at least support
lanewise integral operations like ADD, SUB, OR, XOR,...

*2) Improve species selection*
The aforementioned spaghetti code required me to find a a
*VectorSpecies<Byte>* with the same lane count as
*FloatVector.SPECIES_PREFERRED*. To achieve that, I had to resort to
reflection
<https://github.com/DirkToewe/tiny_vectorized_fft/blob/master/src/main/java/tiny_vectorized_fft/TinyFFT.java#L17-L31>,
which is never a good sign. Additional methods for species selection would
be really helpful, the following come to mind:

   - Add a static methods along the lines of
   *FloatVector.speciesWithLaneCount(int)*
   - Add a conversion methods to each species, something like
   *VectorSpecies<Float>::toIntSpecies()*
   - Add static methods, e.g. *FloatVector.availableSpecies()*, that list
   all available species

*3) Why is VectorSpecies type-dependent?*
As my example above demonstrates, it is sometimes necessary to work with
different vector types of the same lane count. If *VectorSpecies *was not
type-dependent, different types could simply be handled by a single
species. This would also make *VectorShuffle *type-independent.
*VectorShuffle::toVector *could always return an *IntVector* instead of
e.g. *VectorShuffle<Double>::toVector* returning a *DoubleVector* which is,
in my opinion, unintuitive.

*4) Add VectorSpecies::loopBoundOptimized(int)*
On an unrelated note: *VectorSpecies::loopBound(int) *always returns the
largest possible bound. While processing 9 elements with an 8-lane species,
zero might still be the best loop bound. If a JRE does not support
vectorization, zero might in fact always be the best loop bound. A method
along the lines of *VectorSpecies::loopBoundOptimized(int)* could allow the
runtime more flexibility to determine an optimal loop bound, maybe even
based on runtime benchmarks.

Hopefully, these suggestions and questions are constructive

Yours sincerely
Dirk
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20230112/6c72ecd8/attachment-0001.htm>