Register blocking using Vector API.

Tue May 14 06:44:10 UTC 2024

Hi.
Thank you for your answers and help, guys.

>Synthesizing larger logical
>vector registers from multiple physical vector registers by
>treating several physical registers as a single logical
>register.  Logically, they could be regarded as being
>concatenated (extra-long logical register) or stacked
>in a new direction (2D tile register).

That is exactly what I meant. I hope we will see this API eventually. Thank
you for the updates.

On Sat, May 11, 2024 at 1:23 AM John Rose <john.r.rose at oracle.com> wrote:

> On 10 May 2024, at 13:14, Paul Sandoz wrote:
>
> > There are no plans to implement such explicit register blocking.
>
> There are no plans, but there have been some discussions,
> assuming I know what Andrii means by the term “register
> blocking”.  It might mean this: Synthesizing larger logical
> vector registers from multiple physical vector registers by
> treating several physical registers as a single logical
> register.  Logically, they could be regarded as being
> concatenated (extra-long logical register) or stacked
> in a new direction (2D tile register).
>
> (Note:  Their memory layouts would be contiguous.
> But their registers would be independently allocated
> by the register allocator.  In some cases, one
> physical segment of a long logical vector might be
> live while another is dead.  We might see this in
> carefully tuned hand-written assembler.)
>
> You can see hints of logical vectors in C APIs for vectors
> that provide 2x/3x/4x/5x register types.  They have many
> use cases, including linear algebra.  ARM has int16x8x2_t.
>
> A hardware vector with VLEN lanes lets you manually
> unroll a loop VLEN ways.  A logical vector with 2VLEN
> lanes lets you unroll the loop twice as much.  If you
> have enough physical registers to unroll by 2VLEN without
> spilling, it might be a win to shape the loop that way.
>
> If we had an assortment of logical vector shapes in
> the Vector API, then we could experiment with our loops
> “plugging in” various lane counts, without being limited
> by the physics of the hardware (except bumping into
> register file limits of course).  It would be very good
> to be able to write a vectorized loop once, and then
> later plug in varying shapes to see which ones perform
> best.  There is often a medium unroll count that is best:
> It gets the most parallelism of the VPU without blowing
> out the register file resources (and spilling).  Being
> able to alter the a single source loop freely by plugging
> in different vector shapes would make it easier for
> the programmer to find the sweet spot.  (And then
> there are dynamic programming tactics after that…)
>
> Another use for logical vector shapes might be mapping
> to tile hardware.  A tile could be viewed as a longer
> vector, containing the ravel of its points.  The vector
> API would express tiles in this way.  The characteristic
> operation of reducing (by add, usually) along one of
> two dimensions could be couched, in terms of the Vector
> API, as a “partial reduce”.  Instead of reducing VLEN
> lanes to a single scalar, the new methods would reduce
> VLEN=M*N lanes to a smaller vector of M (or N) values.
>
> That is, I think the Vector API could be a home for
> hardware tiles as well as hardware vectors.  I could
> be wrong, but it seems to me that lanewise operations
> don’t care about 1D vs. 2D, and special ops like
> transpose are just (standard) shuffles.  The main
> missing bit is the one I pointed out, partial reduce,
> and that is already present in a limited form.
>
> A similar point goes for the reverse of reduce, which
> is broadcast.  An M-lane partial broadcast could yield
> a MxN lane (or NxM-lane) logical vector.
>
> An outer product (if you need one) can be composed from
> partial broadcasts and a lanewise product.
>
> A full tile-by-tile multiply, if supported by hardware,
> cannot be readily expressed in such terms, since it
> has a 3D temporary value, and there seems to be not
> much value to making that explicit.  But it could be
> an intrinsified Java method.
>
> A vector-by-tile multiply would be easy to express
> in the enhanced Vector API:  Do a partial broadcast
> of the vector, then a lanewise product, then a partial
> reduce.  If hardware supports it all in one command,
> the pattern is simple enough to recognized in the IR
> optimizer and replace by the special command.
>
> Logical vectors composed of multiple physical vectors
> might also be useful as tables, or as enhanced windows
> over columnar data, as Andrii seems to suggest.  There
> are hardware instructions which take pairs of registers
> and treat them as lookup tables, and clearly they
> could be understood in terms of logical lookups on
> single 2VLEN registesr.
>
> — John
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/panama-dev/attachments/20240514/e6553fe0/attachment.htm>