<div dir="ltr">Hi.<div>Thank you for your answers and help, guys.<br><div><br><div>>Synthesizing larger logical</div>>vector registers from multiple physical vector registers by<br>>treating several physical registers as a single logical<br>>register.  Logically, they could be regarded as being<br>>concatenated (extra-long logical register) or stacked<br>>in a new direction (2D tile register).</div></div><div><br></div><div>That is exactly what I meant. I hope we will see this API eventually. Thank you for the updates.<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, May 11, 2024 at 1:23 AM John Rose <<a href="mailto:john.r.rose@oracle.com">john.r.rose@oracle.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 10 May 2024, at 13:14, Paul Sandoz wrote:<br>

<br>

> There are no plans to implement such explicit register blocking.<br>

<br>

There are no plans, but there have been some discussions,<br>

assuming I know what Andrii means by the term “register<br>

blocking”.  It might mean this: Synthesizing larger logical<br>

vector registers from multiple physical vector registers by<br>

treating several physical registers as a single logical<br>

register.  Logically, they could be regarded as being<br>

concatenated (extra-long logical register) or stacked<br>

in a new direction (2D tile register).<br>

<br>

(Note:  Their memory layouts would be contiguous.<br>

But their registers would be independently allocated<br>

by the register allocator.  In some cases, one<br>

physical segment of a long logical vector might be<br>

live while another is dead.  We might see this in<br>

carefully tuned hand-written assembler.)<br>

<br>

You can see hints of logical vectors in C APIs for vectors<br>

that provide 2x/3x/4x/5x register types.  They have many<br>

use cases, including linear algebra.  ARM has int16x8x2_t.<br>

<br>

A hardware vector with VLEN lanes lets you manually<br>

unroll a loop VLEN ways.  A logical vector with 2VLEN<br>

lanes lets you unroll the loop twice as much.  If you<br>

have enough physical registers to unroll by 2VLEN without<br>

spilling, it might be a win to shape the loop that way.<br>

<br>

If we had an assortment of logical vector shapes in<br>

the Vector API, then we could experiment with our loops<br>

“plugging in” various lane counts, without being limited<br>

by the physics of the hardware (except bumping into<br>

register file limits of course).  It would be very good<br>

to be able to write a vectorized loop once, and then<br>

later plug in varying shapes to see which ones perform<br>

best.  There is often a medium unroll count that is best:<br>

It gets the most parallelism of the VPU without blowing<br>

out the register file resources (and spilling).  Being<br>

able to alter the a single source loop freely by plugging<br>

in different vector shapes would make it easier for<br>

the programmer to find the sweet spot.  (And then<br>

there are dynamic programming tactics after that…)<br>

<br>

Another use for logical vector shapes might be mapping<br>

to tile hardware.  A tile could be viewed as a longer<br>

vector, containing the ravel of its points.  The vector<br>

API would express tiles in this way.  The characteristic<br>

operation of reducing (by add, usually) along one of<br>

two dimensions could be couched, in terms of the Vector<br>

API, as a “partial reduce”.  Instead of reducing VLEN<br>

lanes to a single scalar, the new methods would reduce<br>

VLEN=M*N lanes to a smaller vector of M (or N) values.<br>

<br>

That is, I think the Vector API could be a home for<br>

hardware tiles as well as hardware vectors.  I could<br>

be wrong, but it seems to me that lanewise operations<br>

don’t care about 1D vs. 2D, and special ops like<br>

transpose are just (standard) shuffles.  The main<br>

missing bit is the one I pointed out, partial reduce,<br>

and that is already present in a limited form.<br>

<br>

A similar point goes for the reverse of reduce, which<br>

is broadcast.  An M-lane partial broadcast could yield<br>

a MxN lane (or NxM-lane) logical vector.<br>

<br>

An outer product (if you need one) can be composed from<br>

partial broadcasts and a lanewise product.<br>

<br>

A full tile-by-tile multiply, if supported by hardware,<br>

cannot be readily expressed in such terms, since it<br>

has a 3D temporary value, and there seems to be not<br>

much value to making that explicit.  But it could be<br>

an intrinsified Java method.<br>

<br>

A vector-by-tile multiply would be easy to express<br>

in the enhanced Vector API:  Do a partial broadcast<br>

of the vector, then a lanewise product, then a partial<br>

reduce.  If hardware supports it all in one command,<br>

the pattern is simple enough to recognized in the IR<br>

optimizer and replace by the special command.<br>

<br>

Logical vectors composed of multiple physical vectors<br>

might also be useful as tables, or as enhanced windows<br>

over columnar data, as Andrii seems to suggest.  There<br>

are hardware instructions which take pairs of registers<br>

and treat them as lookup tables, and clearly they<br>

could be understood in terms of logical lookups on<br>

single 2VLEN registesr.<br>

<br>

— John<br>

</blockquote></div>