Improve the efficiency of VectorShuffle usage

Wed Dec 18 20:45:29 UTC 2024

Hi Quan,

I believe at the moment the selectFrom operations are currently more optimal. Even though we specify behavior in terms of rearrange we currently optimize differently. Ideally selectFrom would be implemented using the rearrange expression and that would be optimal for such use.

In our prior iteration we focused on changing the behavior of the selectFrom and rearrange operations, and left out the shuffle conversions and factories knowing that we would need to address them later. It’s good you are bringing this topic up. John, I, and others had some prior discussions on what to do regarding identifying wrapped and partially wrapped state of a shuffle, but IIRC nothing definitive was concluded as to how to do this, e.g., something internal, such as an internal type or some state on shuffle.

Regardless of the underlying representation and optimizing approach I think we need to have factories that by default wrap and/or consistently accept a wrapping argument.

IIRC we discussed whether it was possible to implicitly detect in C2 if a shuffle is (fully) wrapped on construction and therefore we don’t need to rewrap on use. That might work for the loop-variant case you mention? but I imagine is harder if we use constant shuffles where it would be beneficial to clearly identify the representtion in compiled code (possibly to hoist out of loops?).

We did not discuss a separation of public types, nor a distinction of use of a single type between the two cases. I admit to not being fond of either and would prefer a solution where shuffles could be used in either case regardless of their wrapping status in line to what is currently specified.

I wonder if we will ever encounter 2048-bit size vectors on ARM cores in the industry, it’s still early days but convergence around 128-bit and maybe 256-bit seems to be the norm as vendors balance core size and power consumption. Perhaps it may be so for more specialized hardware, although it’s more likely in that case one would encounter a GPU or a tensor/matrix core instead. We shall see! However, that’s not to say we should ignore it, but IMO we should for now avoid giving it undue prominence.  

Separately, your proposal on the shuffle conversions and factories addresses the fact that Float/DoubleVector sits awkwardly with shuffles, such use likely indicates incorrect use. Java's type system is unfortunately not expressive enough to express the constraints we want so limiting conversion/construction to integral species seems like a good idea, with casting for use with float-point vectors (as in anyway required in some cases for masks).

Paul.

> On Dec 17, 2024, at 11:27 PM, Quân Anh Mai <anhmdq at gmail.com> wrote:
> 
> Hi,
> 
> I want to discuss how to improve the efficiency of creating and using a VectorShuffle.
> 
> Currently, when a VectorShuffle is created, it will try to wrap all out-of-bound indices into the interval [-VLENGTH, -1]. However, this is useless most of the time, as the most frequent operation with a VectorShuffle, rearrange on a single vector, will wrap all indices to the interval [0, VLENGTH - 1] regardless. This may be noticeable in look up algorithms such as UTF-8 validation, as in those algorithms, the VectorShuffle is a loop-variant and will be computed in each iteration. This is a consequence of the fact that VectorShuffle is used for 2 operations: 1-operand rearrange and 2-operand rearrange.
> 
> As a result, I propose we add a field to VectorShuffle to discriminate instances which are used for 1-operand rearrange and instances which are used for 2-operand rearrange. For instances which are created for 1-operand rearrange, all indices are wrapped to [0, VLENGTH - 1] while for ones which are for 2-operand rearrange, the interval for indices to wrap to is [0, 2 * VLENGTH - 1]. Instances which are created for 1 operation must not be used for the other.
> 
> This distinction is more preferable than just changing the VectorShuffle creation semantics so that elements are wrapped to the interval [0, 2 * VLENGTH - 1] because of 3 reasons:
> 
> - It aligns the wrapping in VectorShuffle creation with the wrapping in VectorShuffle usage, which helps reduce 1 unnecessary wrapping.
> - It is necessary to support 2048-bit SVE rearrange. As for those, a 1-operand byte rearrange is sensible while a 2-operand byte rearrange is not (there are 512 elements in a table from 2 vectors, which is larger than the index values themselves). While we can catch the usage in 2-operand rearrange, the semantics are muddy for other operations such as toVector or toString. This is because we inevitably lose information when casting the elements to the implementation-detail type. It would be confusing if, for all species, elements are wrapped to the interval [0, 2 * VLENGTH - 1] while suddenly for 2048-bit byte shuffles, the elements are wrapped to the interval [0, VLENGTH - 1]. As a result, forbidding creating such a VectorShuffle in the first place is a more sensible choice.
> - Using a VectorShuffle for both 1-operand rearranges and 2-operand rearranges is questionable, as they have different semantics. If the users want to use 1 index vector that is sensible to be converted to both a 1-operand shuffle and a 2-operand shuffle. Then 2 conversion seems to be a more reasonable thing to do.
> 
> API-wise, I propose removing Vector::toShuffle and adding 8 methods:
> 
> <T> VectorShuffle<T> ByteVector::toShuffle(VectorSpecies<T> species)
> <T> VectorShuffle<T> ByteVector::toLookUpIndices(VectorSpecies<T> species, int numTable)
> <T> VectorShuffle<T> ShortVector::toShuffle(VectorSpecies<T> species)
> <T> VectorShuffle<T> ShortVector::toLookUpIndices(VectorSpecies<T> species, int numTable)
> <T> VectorShuffle<T> IntVector::toShuffle(VectorSpecies<T> species)
> <T> VectorShuffle<T> IntVector::toLookUpIndices(VectorSpecies<T> species, int numTable)
> <T> VectorShuffle<T> LongVector::toShuffle(VectorSpecies<T> species)
> <T> VectorShuffle<T> LongVector::toLookUpIndices(VectorSpecies<T> species, int numTable)
> 
> where species must have the same length as the receiver and numTable must be 1 or 2, XXXVector::toShuffle(species) is equivalent to XXXVector::toLookUpIndices(species, 1)
> 
> On the opposite direction, I also propose removing VectorShuffle::toVector and adding VectorShuffle::toVector(VectorSpecies species) where species must be an integral vector with the same length as the receiver.
> 
> Alternatively, we can split VectorShuffle into VectorShuffle and VectorLookUpIndices. The former can only be used for 1-operand rearrange while the latter can only be used for 2-operand rearrange. Personally, I prefer this approach as it gives us a stronger type safety compared to discriminating VectorShuffle based on a field.
> 
> Please share your thoughts, thanks a lot,
> Quan Anh