implementation of Vector API and associated lambdas

Fri Jan 22 14:54:47 UTC 2016

Hi,

To get a feel at the API level i took one of John’s straw man APIs, tweaked it and hacked up a simple implementation, for some aspects. I conveniently ignored aspects of optimisation, there is a lot of boxing going on.

The main idea is one asks for a vector species for a given unit type and width:
  Vector.Species<Byte, Shapes.S64Bit> s = Vector.byteSpecies(Shapes.SHAPE_64_BIT);
This serves as the factory to create instances:
  Vector<Byte, Shapes.S64Bit> zeros = s.zeroVector();
  Vector<Byte, Shapes.S64Bit> ones = s.fromElement((byte) 1);
  Vector<Byte, Shapes.S64Bit> twos = s.fromElement((byte) 2);
and then perform operations:
  Vector<Byte, Shapes.S64Bit> r = zeros.add(ones).add(twos);
Other examples:

{
    Vector.Species<Byte, Shapes.S512Bit> s = Vector.byteSpecies(Shapes.SHAPE_512_BIT);

    Vector<Byte, Shapes.S512Bit> a = s.zeroVector();
    Vector<Byte, Shapes.S512Bit> b = s.generate(i -> (byte) (i % 2 == 0 ? 1 : 0));

    Vector<Byte, Shapes.S512Bit> r = a.compareEqual(b);
    System.out.println(r);

    Vector.Mask<Shapes.S512Bit> m = r.toMask();

    System.out.println(Long.toBinaryString(m.toLong()));

    r = m.toVector(Byte.class);
    System.out.println(r);
}

{
    Vector.Species<Byte, Shapes.S512Bit> s = Vector.byteSpecies(Shapes.SHAPE_512_BIT);

    Vector<Byte, Shapes.S512Bit> a = s.generate(i -> (byte) i);

    Vector.Shuffle<Shapes.S512Bit> iota = s.iota(a.length() - 1, -1, a.length());

    Vector<Byte, Shapes.S512Bit> b = a.shuffle(iota);
    System.out.println(a);
    System.out.println(b);
}
I only implemented a species for Byte, but it’s easy to extend to say Integer with another species-based factory method.

My experience so far is i think such an API could work (with lambdas too). The API design area is more my area of strength, optimisation-wise i don’t have enough experience to really know for sure.

A pro is the code using the API clearly expresses intent, which may be easier for a runtime compiler.

A con is such code is hardcoded to specific widths. A more general runtime compiler solution should ideally dynamically choose the most appropriate width for the platform and unit type when operating over non-fixed lengths.

Still, at this stage it may be worth exploring a number of approaches based on lower level building blocks.

Paul.

[1] http://cr.openjdk.java.net/~psandoz/panama/vector.zip

> On 22 Jan 2016, at 01:31, John Rose <john.r.rose at oracle.com> wrote:
> 
> Several people have been looking at refining and implementing the straw-man Vector API I proposed[1].  The exchange of December 4 [2] shows some of this work but there's been more!
> 
> [1]: http://cr.openjdk.java.net/~jrose/arrays/vector/Vector.java <http://cr.openjdk.java.net/~jrose/arrays/vector/Vector.java>
> [2]: http://mail.openjdk.java.net/pipermail/panama-dev/2015-December/thread.html#225 <http://mail.openjdk.java.net/pipermail/panama-dev/2015-December/thread.html#225>
> 
> First, background:  Vector types for Java are an important aspect for Panama, because Panama is about binding Java more closely to hardware and system APIs.  One big reason, today, for binding closely to hardware is to process wider data flows in SIMD modes.  (And IMO this is a long-term trend towards right-sizing data channel widths, as hardware grows wider in various ways.)  Vector.java and AVX bindings are where we are experimenting, today, with busting Java's 64-bit limit.  (In the future this will support a variety of SIMD CPUs, and converge smoothly with Valhalla value types and parametric polymorphism, to create a natural looking set of primitive-like value types for vectors of all types, on all platforms.)
> 
> So, recently, Paul Sandoz has created a skeletal web of concrete types that "bottoms out" in a simple byte-based container.  And Ian Graves (of Intel) has whipped up a single-class reference implementation, that uses extra metadata fields to steer around the shapes and element types.  Michael Berg (Intel) is our expert in compiling loops through the HotSpot SuperWord optimizations into the latest and greatest Intel vector instructions. Finally, Vladimir has been working on a strategy for creating quick punches all the way through to the CPU, in a way that should support quick prototyping of vector intrinsics, on all of our platforms.  (Quick prototyping does not mean the feature is of temporary use:  It seems there are always new instructions to grapple with, and there may always be a place for doing so without tight integration with the JIT.  It's the case now, anyway.  In a future Graal-based world, adding plugins for new instructions will be much easier; maybe that will become the best way to prototype with new instructions.)
> 
> One perhaps surprising aspect of the Vector API is that it factors the complexity of the API by lambda-parameterizing families of methods; the lambdas supply elementwise or reduction operations (as v.map(fn) or v.reduce(fn) instead of v.negate(), v.sum(), etc.).  This is inspired by JDK 8 streams, of course.  IMO, it is an excellent bit of usability and future-proofing, but it has a cost:  Instruction selection for the vector operation requires that the lambda be "cracked" to examine the part of the loop kernel it carries.  The standard way to do this is wait for the JIT to inline everything (and hope it doesn't get tired of inlining halfway through).  When this works, it is spectacularly successful.  It can fail silently.  And it requires special powers and privileges to do the cracking, which are not trivial to define (though .NET has done it successfully).
> 
> I want to keep the lambdas, but Ian has pointed out that we need to start by special-casing some of the methods (e.g., lane-wise or cross-lane addition) for demonstration purposes.  That's fine; we did things like that with streams.  To make the lambdas carry the weight, we want (IMO) two things, one short-term and one long-term.  In the long-term we want a lambda cracker, something like our method-handle cracker.  It should be able to completely analyze at least simple lambdas like `(x,y)->(x+y)`, which are really just wrapped `iadd` instructions.  The JVMCI should also provide a robust way to crack them all, but we have to be careful how we use the high privileges (Unsafe-level) required to use the JVMCI.
> 
> In the short term, I propose creating a VectorOp interface (and/or VectorOps class) which contains a large number of factories and/or static finals which give alternate notations for the loop kernel components (like `iadd`) required by the Vector API.  It's an under-appreciated fact that lambdas interoperate with interfaces, and a lambda-like value can implement not only a primary interface, but additional secondary ones.  In the case of VectorOps, the IntOperator implementing scalar 32-bit `iadd` should also implement whatever additional interfaces are necessary for the Vector API to properly categorize it as part of a loop.  That means one or both of the following:  1. A metadata API which voluntarily reports the operation.  As, "I am really just iadd", or maybe "I am the following expression tree".  2. An API lifted to vectors, either "I am willing to take 4 sets of inputs at once", or "I am willing to produce a method handle to work on right-sized vectors".  I've posted an example of what I mean by this [3].
> 
> [3]: http://cr.openjdk.java.net/~jrose/arrays/vector/VectorOp.java
> 
> Working with crackable lambdas might give us insights into helping JDK 8 streams also, in ways that do not require always-lucky inlining.  Of course we don't want to require users, in general, to use lambdas from a fixed menu (like VectorOps) but working with the side handshakes will help us understand what a general lambda cracker should look like, if it must help vectorize loops.
> 
> Now for some thoughts about the vectors themselves.  Basically, I think we should do whatever it takes to make the concrete representation (whether hidden or not) of a Vector to be as simple as possible, even if this requires complicated class structure.  Each individual concrete class must, of course, support full box-elision optimization so aggressive that it can laugh at Lady Luck.  (We'll get there in Valhalla, where boxing is a rare, relatively explicit event, but currently we need to work harder to lift our vectors out of Java's data model, which requires a heap node for every value.)
> 
> In the end, I think we will prefer to factor every property about a vector, except for the actual payload bits, into an auxiliary object (shared by all vectors of the same "species"), and use a class-per-species implementation strategy.  This will lead to MxN classes, lazily generated under the covers; Valhalla parametric polymorphism will reduce one or both factors.
> 
> (Rationale:  When they become Valhalla value types, they will be pretty large, and it will be best if they fit cleanly into a vector register without extra fields.  Also, for now, class-based data constant-folds more reliably than field-based constant data, the notorious trust-final-fields problem.  In any case, the JVM has lots of class inference, profiling, and speculation logic; the more info. we derive from the concrete class, the more we leverage that mature technology. There's going to be an object header anyway, so let's put it to work!)
> 
> Perhaps fast type-specific paths can be mediated by boxed generic entry points (which as a downside rely on always-lucky box elision).  Or, following Paul's VarHandles, they can use method handles for fast type-specific entry points, in which case a species object contains a bundle of preset method handles, and looks very much like a v-table.  That sounds better to me, since MHs require less luck to choose efficient calling sequences.  Valhalla parameter instances will support species more directly (N species per class) but for now we probably want a 1-1 species/class ratio.
> 
> The various concrete classes would then extend a small number of intrinsically magic, maximally aligned, pure-data container types (Long2/4/8), adding various parameterized interfaces with default methods, plus (as absolutely necessary) mechanically specialized methods.  (BTW, some of those interfaces, like Vector itself, are public.  Some, like the lambda cracking APIs mentioned above, would be only for internal handshaking—modules give us more options for hiding public names, thus encapsulating interface views despite publicity of methods.)
> 
> For maximum reuse and cleanliness, it should be the case that nearly all of the code is either generic (some of it packaged in interface default methods), or else is generated on the fly via indy instructions (in the concrete classes) or via MH-based handshakes originating in the generic code.  Big extra bonus points if we don't resort to ASM, but that's likely only in the Valhalla world, since concrete signatures may need to be adjusted (which is a Valhalla super-power).
> 
> Depending on how we use MHs, it the vectors might act as if they are coded in a dynamic language designed for expanding loop kernels.  On a Graal-based stack we might experiment with Truffle to do stuff like that.  For now we need to build on JDK 8 technology for this project, which means closures and indy.
> 
> — John