Register blocking using Vector API.

John Rose john.r.rose at oracle.com
Fri May 10 23:15:08 UTC 2024


On 10 May 2024, at 13:14, Paul Sandoz wrote:

> There are no plans to implement such explicit register blocking.

There are no plans, but there have been some discussions,
assuming I know what Andrii means by the term “register
blocking”.  It might mean this: Synthesizing larger logical
vector registers from multiple physical vector registers by
treating several physical registers as a single logical
register.  Logically, they could be regarded as being
concatenated (extra-long logical register) or stacked
in a new direction (2D tile register).

(Note:  Their memory layouts would be contiguous.
But their registers would be independently allocated
by the register allocator.  In some cases, one
physical segment of a long logical vector might be
live while another is dead.  We might see this in
carefully tuned hand-written assembler.)

You can see hints of logical vectors in C APIs for vectors
that provide 2x/3x/4x/5x register types.  They have many
use cases, including linear algebra.  ARM has int16x8x2_t.

A hardware vector with VLEN lanes lets you manually
unroll a loop VLEN ways.  A logical vector with 2VLEN
lanes lets you unroll the loop twice as much.  If you
have enough physical registers to unroll by 2VLEN without
spilling, it might be a win to shape the loop that way.

If we had an assortment of logical vector shapes in
the Vector API, then we could experiment with our loops
“plugging in” various lane counts, without being limited
by the physics of the hardware (except bumping into
register file limits of course).  It would be very good
to be able to write a vectorized loop once, and then
later plug in varying shapes to see which ones perform
best.  There is often a medium unroll count that is best:
It gets the most parallelism of the VPU without blowing
out the register file resources (and spilling).  Being
able to alter the a single source loop freely by plugging
in different vector shapes would make it easier for
the programmer to find the sweet spot.  (And then
there are dynamic programming tactics after that…)

Another use for logical vector shapes might be mapping
to tile hardware.  A tile could be viewed as a longer
vector, containing the ravel of its points.  The vector
API would express tiles in this way.  The characteristic
operation of reducing (by add, usually) along one of
two dimensions could be couched, in terms of the Vector
API, as a “partial reduce”.  Instead of reducing VLEN
lanes to a single scalar, the new methods would reduce
VLEN=M*N lanes to a smaller vector of M (or N) values.

That is, I think the Vector API could be a home for
hardware tiles as well as hardware vectors.  I could
be wrong, but it seems to me that lanewise operations
don’t care about 1D vs. 2D, and special ops like
transpose are just (standard) shuffles.  The main
missing bit is the one I pointed out, partial reduce,
and that is already present in a limited form.

A similar point goes for the reverse of reduce, which
is broadcast.  An M-lane partial broadcast could yield
a MxN lane (or NxM-lane) logical vector.

An outer product (if you need one) can be composed from
partial broadcasts and a lanewise product.

A full tile-by-tile multiply, if supported by hardware,
cannot be readily expressed in such terms, since it
has a 3D temporary value, and there seems to be not
much value to making that explicit.  But it could be
an intrinsified Java method.

A vector-by-tile multiply would be easy to express
in the enhanced Vector API:  Do a partial broadcast
of the vector, then a lanewise product, then a partial
reduce.  If hardware supports it all in one command,
the pattern is simple enough to recognized in the IR
optimizer and replace by the special command.

Logical vectors composed of multiple physical vectors
might also be useful as tables, or as enhanced windows
over columnar data, as Andrii seems to suggest.  There
are hardware instructions which take pairs of registers
and treat them as lookup tables, and clearly they
could be understood in terms of logical lookups on
single 2VLEN registesr.

— John


More information about the panama-dev mailing list