Improve the efficiency of VectorShuffle usage

Thu Dec 19 17:02:47 UTC 2024

Thanks a lot for your response,

Actually, we do not need to wrap at all on construction, a VectorShuffle is
a black box, from the creation of a VectorShuffle to its usage there is
somewhere in between when we do the wrap, and that is totally enough.

>From your response, I think a more reasonable proposal would be: On
VectorShuffle construction, we wrap all indices to [0, 2 * VLENGTH - 1]
(instead of the current model of wrapping oob indices to [-VLENGTH, -1])
and when using that VectorShuffle for a 1-operand rearrange, we wrap the
indices to [0, VLENGTH - 1].

Implementation-wise, we do not do any wrapping on construction, and for
operations that observe the VectorShuffle instance, we wrap at those
places. This allows us to reduce the number of wrapping to the minimum for
the most frequent operations. For other operations, the wrapping operation
is itself cheap and should be GVN-ed.

An important question is that what we should do with the hypothetical
2048-bit byte VectorShuffles. We cannot wrap those to [0, 2 * VLENGTH - 1]
because of implementation limitations. Should then we disallow all
operations that would observe that (toVector, laneSource, 2-operand
rearrange)?

Cheers,
Quan Anh

On Thu, 19 Dec 2024 at 03:45, Paul Sandoz <paul.sandoz at oracle.com> wrote:

> Hi Quan,
>
> I believe at the moment the selectFrom operations are currently more
> optimal. Even though we specify behavior in terms of rearrange we currently
> optimize differently. Ideally selectFrom would be implemented using the
> rearrange expression and that would be optimal for such use.
>
> In our prior iteration we focused on changing the behavior of the
> selectFrom and rearrange operations, and left out the shuffle conversions
> and factories knowing that we would need to address them later. It’s good
> you are bringing this topic up. John, I, and others had some prior
> discussions on what to do regarding identifying wrapped and partially
> wrapped state of a shuffle, but IIRC nothing definitive was concluded as to
> how to do this, e.g., something internal, such as an internal type or some
> state on shuffle.
>
> Regardless of the underlying representation and optimizing approach I
> think we need to have factories that by default wrap and/or consistently
> accept a wrapping argument.
>
> IIRC we discussed whether it was possible to implicitly detect in C2 if a
> shuffle is (fully) wrapped on construction and therefore we don’t need to
> rewrap on use. That might work for the loop-variant case you mention? but I
> imagine is harder if we use constant shuffles where it would be beneficial
> to clearly identify the representtion in compiled code (possibly to hoist
> out of loops?).
>
> We did not discuss a separation of public types, nor a distinction of use
> of a single type between the two cases. I admit to not being fond of either
> and would prefer a solution where shuffles could be used in either case
> regardless of their wrapping status in line to what is currently specified.
>
>
> I wonder if we will ever encounter 2048-bit size vectors on ARM cores in
> the industry, it’s still early days but convergence around 128-bit and
> maybe 256-bit seems to be the norm as vendors balance core size and power
> consumption. Perhaps it may be so for more specialized hardware, although
> it’s more likely in that case one would encounter a GPU or a tensor/matrix
> core instead. We shall see! However, that’s not to say we should ignore it,
> but IMO we should for now avoid giving it undue prominence.
>
>
> Separately, your proposal on the shuffle conversions and factories
> addresses the fact that Float/DoubleVector sits awkwardly with shuffles,
> such use likely indicates incorrect use. Java's type system is
> unfortunately not expressive enough to express the constraints we want so
> limiting conversion/construction to integral species seems like a good
> idea, with casting for use with float-point vectors (as in anyway required
> in some cases for masks).
>
> Paul.
>
>
> > On Dec 17, 2024, at 11:27 PM, Quân Anh Mai <anhmdq at gmail.com> wrote:
> >
> > Hi,
> >
> > I want to discuss how to improve the efficiency of creating and using a
> VectorShuffle.
> >
> > Currently, when a VectorShuffle is created, it will try to wrap all
> out-of-bound indices into the interval [-VLENGTH, -1]. However, this is
> useless most of the time, as the most frequent operation with a
> VectorShuffle, rearrange on a single vector, will wrap all indices to the
> interval [0, VLENGTH - 1] regardless. This may be noticeable in look up
> algorithms such as UTF-8 validation, as in those algorithms, the
> VectorShuffle is a loop-variant and will be computed in each iteration.
> This is a consequence of the fact that VectorShuffle is used for 2
> operations: 1-operand rearrange and 2-operand rearrange.
> >
> > As a result, I propose we add a field to VectorShuffle to discriminate
> instances which are used for 1-operand rearrange and instances which are
> used for 2-operand rearrange. For instances which are created for 1-operand
> rearrange, all indices are wrapped to [0, VLENGTH - 1] while for ones which
> are for 2-operand rearrange, the interval for indices to wrap to is [0, 2 *
> VLENGTH - 1]. Instances which are created for 1 operation must not be used
> for the other.
> >
> > This distinction is more preferable than just changing the VectorShuffle
> creation semantics so that elements are wrapped to the interval [0, 2 *
> VLENGTH - 1] because of 3 reasons:
> >
> > - It aligns the wrapping in VectorShuffle creation with the wrapping in
> VectorShuffle usage, which helps reduce 1 unnecessary wrapping.
> > - It is necessary to support 2048-bit SVE rearrange. As for those, a
> 1-operand byte rearrange is sensible while a 2-operand byte rearrange is
> not (there are 512 elements in a table from 2 vectors, which is larger than
> the index values themselves). While we can catch the usage in 2-operand
> rearrange, the semantics are muddy for other operations such as toVector or
> toString. This is because we inevitably lose information when casting the
> elements to the implementation-detail type. It would be confusing if, for
> all species, elements are wrapped to the interval [0, 2 * VLENGTH - 1]
> while suddenly for 2048-bit byte shuffles, the elements are wrapped to the
> interval [0, VLENGTH - 1]. As a result, forbidding creating such a
> VectorShuffle in the first place is a more sensible choice.
> > - Using a VectorShuffle for both 1-operand rearranges and 2-operand
> rearranges is questionable, as they have different semantics. If the users
> want to use 1 index vector that is sensible to be converted to both a
> 1-operand shuffle and a 2-operand shuffle. Then 2 conversion seems to be a
> more reasonable thing to do.
> >
> > API-wise, I propose removing Vector::toShuffle and adding 8 methods:
> >
> > <T> VectorShuffle<T> ByteVector::toShuffle(VectorSpecies<T> species)
> > <T> VectorShuffle<T> ByteVector::toLookUpIndices(VectorSpecies<T>
> species, int numTable)
> > <T> VectorShuffle<T> ShortVector::toShuffle(VectorSpecies<T> species)
> > <T> VectorShuffle<T> ShortVector::toLookUpIndices(VectorSpecies<T>
> species, int numTable)
> > <T> VectorShuffle<T> IntVector::toShuffle(VectorSpecies<T> species)
> > <T> VectorShuffle<T> IntVector::toLookUpIndices(VectorSpecies<T>
> species, int numTable)
> > <T> VectorShuffle<T> LongVector::toShuffle(VectorSpecies<T> species)
> > <T> VectorShuffle<T> LongVector::toLookUpIndices(VectorSpecies<T>
> species, int numTable)
> >
> > where species must have the same length as the receiver and numTable
> must be 1 or 2, XXXVector::toShuffle(species) is equivalent to
> XXXVector::toLookUpIndices(species, 1)
> >
> > On the opposite direction, I also propose removing
> VectorShuffle::toVector and adding VectorShuffle::toVector(VectorSpecies
> species) where species must be an integral vector with the same length as
> the receiver.
> >
> > Alternatively, we can split VectorShuffle into VectorShuffle and
> VectorLookUpIndices. The former can only be used for 1-operand rearrange
> while the latter can only be used for 2-operand rearrange. Personally, I
> prefer this approach as it gives us a stronger type safety compared to
> discriminating VectorShuffle based on a field.
> >
> > Please share your thoughts, thanks a lot,
> > Quan Anh
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20241220/34e62317/attachment.htm>