Vector API target platforms

Mon Nov 20 18:04:17 UTC 2017

The previous exchange below, on the target platforms of the Vector
API, is worth giving a fresh title and a fuller expression.

The work on the Vector API is a joint effort between Intel and Oracle
in the OpenJDK, aimed at direct coding of loops using AVX, AVX2,
and AVX512.  But, it is also a proper OpenJDK project, meaning the
design is intended to scale to other vector architectures.  I mentioned
SVE below because that's one that will occur to everyone who knows
ARM, but I could have also mentioned NEON.

In fact, there are many interesting integrated instruction sets like AVX,
SVE, VIS, and AltiVec.  The Java Vector API provides a way to abstract
over the details of these architectures, so that loops can be written in a
portable manner.  In some cases it should be possible to code a simple
vector loop once and run it efficiently on a range of architectures.

Still, the most effective loop shape depends intimately on details of
the target, most notably vector size and supported lane types, also
modes of memory access and partial operations.  So the design of
the Vector API is a running battle between the programmer's need
to express those key decisions, and the need to keep those decisions
from overwhelming the basic logical form of the loop.

With more metaprogramming power on the JVM, including condy
and (future) crackable lambdas, we will have more options for
designing a graceful factoring between the logical form of a loop
and its efficient embodiment on a particular platform.  Even today
it is possible to "virtualize" the size (lane count) of a vector, assuming
support for a particular lane type.  We can also virtualize the lane
type, to exotic types like half-float, although this is a bit more awkward
since Java demands a lane type that it knows about.  Still later,
bigger extensions like value types and template classes will
smooth out programming with generics over exotic types.

Meanwhile, there are more target platforms out there besides the
ones mentioned above.  There are the GPUs and other coprocessors,
which typically operate on large vectors with data-dependent sizes
and shapes.  (I alluded to these indirectly at the end of the enclosed
exchange with Andrew.)  Further down the road there are other
bulk processing mechanisms, such as FPGAs.  From the past,
there is the wonderful intermediate bulk-SIMD IR by Guy Blelloch
called VCODE, which was designed for the Connection Machine,
but can be deployed on a wide variety of hardware, and has
inspired researchers even in this century.  It is our ambition that
the Vector API, or some successor, can be adjusted and applied
to all of these backends.

— John

http://mail.openjdk.java.net/pipermail/panama-dev/2017-November/000781.html

On Nov 17, 2017, at 10:54 AM, John Rose <john.r.rose at oracle.com> wrote:

On Nov 16, 2017, at 1:17 PM, Vladimir Ivanov <vladimir.x.ivanov at oracle.com <mailto:vladimir.x.ivanov at oracle.com>> wrote:
> 
> Andrew,
> 
>>> FYI I did a quick experiment with more generic vector intrinsics and
>>> wanted to share the first results. The motivation was to explore
>>> possible reduction in the number of intrinsics needed to support Vector
>>> API in the JVM.
>> Out of interest ... has anyone looked at the suitability of the
>> intrinsics for non-Intel architectures?  Obviously I'm concerned about
>> the possibility of ending up with a bunch of C2 patterns that don't map
>> onto, say, my favourite architecture.
> 
> I haven't heard about any experiments with Vector API on non-x86 architectures, but if C2 knows how to lower vector ideal nodes on your favorite architecture, then Vector API intrinsics "just work".

That's where we are aiming.  It might seem risky to base the design only
on x86 experiments, but…  the x86 is 3 or 4 different vector architectures
combined.  Thus, Java's WORA value proposition is valuable just as a
tactic for tracking x86 generations.

Thus, the design as written should work for any fixed-sized vector architecture.
That *probably* includes SVE, since the runtime sense of vector size can be
folded into the vector-selection factories.

The design is *intended* to work for variable-sized vector architectures *also*,
but so far we have only been thinking about that, not experimenting.

An intermediate step will be a "partial vector with mask" abstraction which
will layer over the AVX (and SVE) vectors in order to support loop edges
(startup and cleanup, pre- and post-loop).  Making sure that works is a step
towards vectors with data-dependent sizes and shapes.

— John