VectorShuffle performance & usability feedback

Xiaohong Gong Xiaohong.Gong at arm.com
Wed Feb 1 03:37:14 UTC 2023


Hi,

Yes, the *VectorShuffle* still has some opportunities to improve, such as the *VectorShuffle.fromArray()*. It can be optimized by adding the vector intrinsic support which finally generates the SIMD instructions. And it has been in our optimization plan. 

Regarding to the example code in fftv2, there are the *FloatVector* and *VectorShuffle* instances stored into the static final field arrays. Each using of the instance in an API needs the vector unboxing inside each loop iteration, which could be eliminated by making the instances local variables (e.g. calling FloatVector.fromArray() inside the loop). This can save several memory load operations I think.

Besides, there are two *FloatVector* instances (i.e. im, re) defined before a loop, updated inside a loop, and used after the loop. It may need vector boxing inside the loop, if the loop is compiled by OSR. Note that the vector boxing the main cause of the bad performance of Vector API. So maybe you could try again with "-XX:-UseOnStackReplacement", which disables the OSR optimization. This is not an issue for me, which may be optimized by Valhalla value types in future.

Thanks,
Xiaohong
 
> 
> Thanks for your feedback on an interesting use-case.h
> 
> 
> - I think you can avoid reflection by doing this
> 
>   VectorSpecies<Float> floatSpecies = FloatVector.SPECIES_PREFERRED;
>   VectorShape byteShape = VectorShape.forBitSize(floatSpecies.length() * 8); // <- throws if shape is not available
>   VectorSpecies<Byte> byteSpecies = VectorSpecies.of(byte.class, byteShape);
>   assert byteSpecies.length() == floatSpecies.length();
> 
> 
> -The integral lane element values of a VectorShuffle are in effect "clamped" and those of an IntVector are not, and further its not possible to represent an IntVector holding the required number of lanes for indexing lane elements for all shapes of ByteVector.
> (We also have some trouble with the internal representation of VectorShuffle for large vector sizes, such as that may be applicable in ARM AVE, although in practice are likely very rare.)
> 
> 
> - So far we have avoided adding the set of integral operations on VectorShuffle (and dealing with the interval constraints) since its possible to lean into conversions, but unfortunately you are running into some performance pot holes. The VectorShuffle::cast methods is not yet optimized, sorry. The VectorShuffle::toVector is, but is alas of no use to you for this use-case. 
> 
> 
> - Your point about "If a JRE does not support vectorization" is well taken. Note that querying the preferred species will give a species whose vector size is supported on the platform, but there is no way to query if a particular species is optimally supported and if so what operations are also supported. Ideally we would prefer developers not have to reason about this and for the implementation to fall back efficiently, but that is not always practical, especially for a low-level API, so some sort of querying may be necessary. IMO I don't loop bound queries are the right place to surface this feature.
> 
> 
> - Species is the factory for the VM to define both the vector size and the lane element type, from that the number of lanes is known and the VM knows what instructions to generate. I think what you are suggesting is that a VectorShuffle be just lane specific with a species lane type (I suppose the same could apply to masks). Any binding to a species with the required number of lanes would be an implementation detail. Any lane-based VectorShuffle could be used with vectors whose species have the same number of lanes. I would need to discuss the possibilities with other team members, but it's potentially a non-trivial change.
> (If we can fix the VectorShuffle cast performance pothole then this really comes down to an API design question, assuming it's possible for the VM to easily optimize.)
> 
> 
> Overall VectorShuffle needs a little more TLC.
> 
> Sandhya, I think it is worth looking at the code generated from the benchmark for the fftV2 checking if there are any dangling optimization issues (e.g. constant loads from the VectorShuffle arrays.)
> 
> Paul.


More information about the panama-dev mailing list