[vector] ARM SVE

Thu Mar 1 20:43:52 UTC 2018

On Mar 1, 2018, at 12:59 AM, Andrew Haley <aph at redhat.com> wrote:
> 
> Thanks for looking at this.  It's very useful to have a look now at
> our core assumptions about what a vector might look like.

Yes.  The API is parameterized by shapes, but shapes do not
*necessarily* have fixed sizes.  Likewise, it has an auxiliary
type for masks and shuffles, but those type do not (intentionally)
encode assumptions about sizes or formats.

As an accident of history, our Intel friends have helped us to do
much of our portability work just on x64, by providing several
different vector architectures to port to, all in one convenient place.

So far all of our vectors have a small range of statically specified
sizes.  SVE stretches, but does not completely break that model.

I'm looking forward to doing a Vector implementation where the
size is truly a dynamic property, something like the VCODE
virtual architecture.  As a Thinking Machines alum, I'm trying
bake that possibility into today's Vector API.

> It's unfortunate that SVE hardware is still a little way away, but if
> we can write shape-agnostic code we should.

This is a good time to talk about where the Vector API
might go in the future, and how we are trying to get there.

The short answer is, a long way in small steps.

Getting any kind of tight vector code from Java is a major
accomplishment all by itself, whether the code is shape
agnostic or not.  Removing shape dependencies has always
been on the roadmap, but we have to start somewhere
more concrete.  As you point out, shape-aware code will
probably always be a use case.  I don't think we will get
stuck there, and I look forward, after a few more rounds
with shape-aware code, to working on the problems of
shape-shifting (shape polymorphic) loops.

One vector shape type I very much want to see prototyped
soon is the "loose end" shape, which is derived from a system
preferred shape, but has an odd smaller size.  Basically,
it is a system-appropriate vector which is derived from
a standard vector, but with a suitable mask or count that
encodes the odd bit left over after all the full vectors have
been processed.  The vector might be either a full vector
plus count, or else a lgN sized collection of successively
half-sized sub-vectors, plus a final scalar.  Depends
on the platform, but the API is simple:  It finishes your
loops for your.  A similar type (or the same in some
cases) will handle the warm-up of loops where alignment
to a multi-lane block is desirable.  Clearly SVE has
its own take on how to do this.

The Vector API seems to be tolerant of multiple 
level of abstraction, so we can play games like that.

It may even be possible to build mega-Vectors,
in the same API or a variant, which have the VCODE
like property of large data dependent sizes (and
masks).  (And permutations.  At full-problem sizes,
a shuffle turns into a routing problem, with potential
reductions at collision points.  A very rich parallel
computing paradigm.)  Such a mega-vector has
a close correspondence to today's streams, and
I hope eventually to be able to work with programs
which look stream-like at the top level, and decompose
into efficient vector-wise loops at the low level.
With, of course, multi-CPU decomposition in the
middle.  That's the known sweet spot for maximizing
throughput for embarrassingly parallel problems,
and I think we'll get there.

Maybe when we get there we will see a stream
of shape-shifting vectors.  Or maybe we'll figure
out how to shift the vectorization down under
the surface of the stream API.  Probably the
first step will come before the second.  Again,
the right move is to code concretely first but
plan to abstract further and further away from
the details, but without losing the tight loops
at any step of the way.

A final thought about shape-shifting loops:
This is a specific instance of the Loop Customization
Problem, where a concrete instance of a generic
algorithm (with an associated loop, or it's not
interesting) has a problem with profile pollution.
The profile pollution prevents our current arsenal
of techniques from boiling the generic loop
down to tight code.  This is sometimes called
the Inlining Problem, on the assumption that
enough inlining will push concrete types into
the generic loop and allow it to boil down.
But inlining is not enough, since modern
loops are routinely dispatched across
threads, and the trip through the scheduler
and across threads is (probably) neither
profitable nor practical to inline fully.

What's needed is a way for the generic
part of the program to say, "this generic
loop, with all its pluggable generic parts,
is a template which depends on the
following externally supplied parameters",
and for the dispatching of such a loop
to involve specializing the template explicitly
under those parameters, without pulling
in the entire inlining context as extra weight.
Specializing the template is the "Customization"
part of the LCP, and the hard part is defining
the template in the first place:  Which parts
of a specific loop request formula (stream
structure, e.g.) are crucial to the loop code,
and which are "just input data".  Input data
is not part of the specialization; the same
specialization can work on different input
data later one, and the work of optimizing
a specialization can be amortized over
many problem requests (or sub-problems
executed on multiple cores).

This is a long-standing problem.  I think
Java's mix of static and dynamic computing
can contribute a uniquely balanced solution.
We do stuff like that under the covers already
with method handle specialization.  But I also
think that a full solution *might* require
language-level features, such as explicit
templates.

http://cr.openjdk.java.net/~jrose/values/template-classes.html

— John

P.S. I'm very glad we are doing this together,
with multiple companies, labs, universities.
What we have done even today would have been
impossible without lots of work and vision from
Intel's management and engineers.  Big problems
like this are and must be solved at the scale of
open source projects.