implementation of Vector API and associated lambdas

Fri Jan 22 14:59:23 UTC 2016

Lots of good thoughts here John and I'm excited you guys are exploring this.

I'd like to add that cracking lambdas would be beneficial in general.
Right now, for precisely the reasons you mentioned, one has to avoid them
in hot paths even when they make the code and API cleaner, which is very
unfortunate.

On Thursday, January 21, 2016, John Rose <john.r.rose at oracle.com> wrote:

> Several people have been looking at refining and implementing the
> straw-man Vector API I proposed[1].  The exchange of December 4 [2] shows
> some of this work but there's been more!
>
> [1]: http://cr.openjdk.java.net/~jrose/arrays/vector/Vector.java <
> http://cr.openjdk.java.net/~jrose/arrays/vector/Vector.java>
> [2]:
> http://mail.openjdk.java.net/pipermail/panama-dev/2015-December/thread.html#225
> <
> http://mail.openjdk.java.net/pipermail/panama-dev/2015-December/thread.html#225
> >
>
> First, background:  Vector types for Java are an important aspect for
> Panama, because Panama is about binding Java more closely to hardware and
> system APIs.  One big reason, today, for binding closely to hardware is to
> process wider data flows in SIMD modes.  (And IMO this is a long-term trend
> towards right-sizing data channel widths, as hardware grows wider in
> various ways.)  Vector.java and AVX bindings are where we are
> experimenting, today, with busting Java's 64-bit limit.  (In the future
> this will support a variety of SIMD CPUs, and converge smoothly with
> Valhalla value types and parametric polymorphism, to create a natural
> looking set of primitive-like value types for vectors of all types, on all
> platforms.)
>
> So, recently, Paul Sandoz has created a skeletal web of concrete types
> that "bottoms out" in a simple byte-based container.  And Ian Graves (of
> Intel) has whipped up a single-class reference implementation, that uses
> extra metadata fields to steer around the shapes and element types.
> Michael Berg (Intel) is our expert in compiling loops through the HotSpot
> SuperWord optimizations into the latest and greatest Intel vector
> instructions. Finally, Vladimir has been working on a strategy for creating
> quick punches all the way through to the CPU, in a way that should support
> quick prototyping of vector intrinsics, on all of our platforms.  (Quick
> prototyping does not mean the feature is of temporary use:  It seems there
> are always new instructions to grapple with, and there may always be a
> place for doing so without tight integration with the JIT.  It's the case
> now, anyway.  In a future Graal-based world, adding plugins for new
> instructions will be much easier; maybe that will become the best way to
> prototype with new instructions.)
>
> One perhaps surprising aspect of the Vector API is that it factors the
> complexity of the API by lambda-parameterizing families of methods; the
> lambdas supply elementwise or reduction operations (as v.map(fn) or
> v.reduce(fn) instead of v.negate(), v.sum(), etc.).  This is inspired by
> JDK 8 streams, of course.  IMO, it is an excellent bit of usability and
> future-proofing, but it has a cost:  Instruction selection for the vector
> operation requires that the lambda be "cracked" to examine the part of the
> loop kernel it carries.  The standard way to do this is wait for the JIT to
> inline everything (and hope it doesn't get tired of inlining halfway
> through).  When this works, it is spectacularly successful.  It can fail
> silently.  And it requires special powers and privileges to do the
> cracking, which are not trivial to define (though .NET has done it
> successfully).
>
> I want to keep the lambdas, but Ian has pointed out that we need to start
> by special-casing some of the methods (e.g., lane-wise or cross-lane
> addition) for demonstration purposes.  That's fine; we did things like that
> with streams.  To make the lambdas carry the weight, we want (IMO) two
> things, one short-term and one long-term.  In the long-term we want a
> lambda cracker, something like our method-handle cracker.  It should be
> able to completely analyze at least simple lambdas like `(x,y)->(x+y)`,
> which are really just wrapped `iadd` instructions.  The JVMCI should also
> provide a robust way to crack them all, but we have to be careful how we
> use the high privileges (Unsafe-level) required to use the JVMCI.
>
> In the short term, I propose creating a VectorOp interface (and/or
> VectorOps class) which contains a large number of factories and/or static
> finals which give alternate notations for the loop kernel components (like
> `iadd`) required by the Vector API.  It's an under-appreciated fact that
> lambdas interoperate with interfaces, and a lambda-like value can implement
> not only a primary interface, but additional secondary ones.  In the case
> of VectorOps, the IntOperator implementing scalar 32-bit `iadd` should also
> implement whatever additional interfaces are necessary for the Vector API
> to properly categorize it as part of a loop.  That means one or both of the
> following:  1. A metadata API which voluntarily reports the operation.  As,
> "I am really just iadd", or maybe "I am the following expression tree".  2.
> An API lifted to vectors, either "I am willing to take 4 sets of inputs at
> once", or "I am willing to produce a method handle to work on right-sized
> vectors".  I've posted an example of what I mean by this [3].
>
> [3]: http://cr.openjdk.java.net/~jrose/arrays/vector/VectorOp.java
>
> Working with crackable lambdas might give us insights into helping JDK 8
> streams also, in ways that do not require always-lucky inlining.  Of course
> we don't want to require users, in general, to use lambdas from a fixed
> menu (like VectorOps) but working with the side handshakes will help us
> understand what a general lambda cracker should look like, if it must help
> vectorize loops.
>
> Now for some thoughts about the vectors themselves.  Basically, I think we
> should do whatever it takes to make the concrete representation (whether
> hidden or not) of a Vector to be as simple as possible, even if this
> requires complicated class structure.  Each individual concrete class must,
> of course, support full box-elision optimization so aggressive that it can
> laugh at Lady Luck.  (We'll get there in Valhalla, where boxing is a rare,
> relatively explicit event, but currently we need to work harder to lift our
> vectors out of Java's data model, which requires a heap node for every
> value.)
>
> In the end, I think we will prefer to factor every property about a
> vector, except for the actual payload bits, into an auxiliary object
> (shared by all vectors of the same "species"), and use a class-per-species
> implementation strategy.  This will lead to MxN classes, lazily generated
> under the covers; Valhalla parametric polymorphism will reduce one or both
> factors.
>
> (Rationale:  When they become Valhalla value types, they will be pretty
> large, and it will be best if they fit cleanly into a vector register
> without extra fields.  Also, for now, class-based data constant-folds more
> reliably than field-based constant data, the notorious trust-final-fields
> problem.  In any case, the JVM has lots of class inference, profiling, and
> speculation logic; the more info. we derive from the concrete class, the
> more we leverage that mature technology. There's going to be an object
> header anyway, so let's put it to work!)
>
> Perhaps fast type-specific paths can be mediated by boxed generic entry
> points (which as a downside rely on always-lucky box elision).  Or,
> following Paul's VarHandles, they can use method handles for fast
> type-specific entry points, in which case a species object contains a
> bundle of preset method handles, and looks very much like a v-table.  That
> sounds better to me, since MHs require less luck to choose efficient
> calling sequences.  Valhalla parameter instances will support species more
> directly (N species per class) but for now we probably want a 1-1
> species/class ratio.
>
> The various concrete classes would then extend a small number of
> intrinsically magic, maximally aligned, pure-data container types
> (Long2/4/8), adding various parameterized interfaces with default methods,
> plus (as absolutely necessary) mechanically specialized methods.  (BTW,
> some of those interfaces, like Vector itself, are public.  Some, like the
> lambda cracking APIs mentioned above, would be only for internal
> handshaking—modules give us more options for hiding public names, thus
> encapsulating interface views despite publicity of methods.)
>
> For maximum reuse and cleanliness, it should be the case that nearly all
> of the code is either generic (some of it packaged in interface default
> methods), or else is generated on the fly via indy instructions (in the
> concrete classes) or via MH-based handshakes originating in the generic
> code.  Big extra bonus points if we don't resort to ASM, but that's likely
> only in the Valhalla world, since concrete signatures may need to be
> adjusted (which is a Valhalla super-power).
>
> Depending on how we use MHs, it the vectors might act as if they are coded
> in a dynamic language designed for expanding loop kernels.  On a
> Graal-based stack we might experiment with Truffle to do stuff like that.
> For now we need to build on JDK 8 technology for this project, which means
> closures and indy.
>
> — John

-- 
Sent from my phone