implementation of Vector API and associated lambdas

Wed Jan 27 17:48:45 UTC 2016

On the platform level, along with machine code snippets prototype, 3 
value-based wrappers for 128-, 256-, and 512-bit vector values were 
added (j.l.Long2 [1], j.l.Long4 [2], and j.l.Long8 [3]).

They are used to represent vector values and JVM does special handling
when passing them into code snippets (tries to place the values in 
vector registers instead of on the heap). Value type experiments are 
part of Project Valhalla, but as a stop-the-gap solution I decided to 
introduce custom optimizations specifically for Long2/Long4/Long8. Once 
JVM understands value types, the vector value wrappers should be marked 
as such.

For a while, I have been playing with binding Vector API to vector 
instructions using machine code snippets. The idea is to take a 
primitive specialization of Vector (IntVector), map j.l.Long2/4/8 to 
Vector.Shape's, and bind some methods to corresponding vector 
instructions on x86. My main motivation is to see how it all shapes from 
abstract API down to machine level with the focus on the quality of 
generated code. Providing complete implementation is out of scope for 
now (it requires some form of lambda decomposition John already wrote 
about).

C2 support for code snippets provides a nice way to embed arbitrary 
machine code into the code JVM generates on the fly. But the final 
generated code is far from optimal when snippet-rich code is compiled: 
there are too many spills of vector values when they are passed around. 
Such code shape is usually observed when there are many 1/2-instruction 
snippets composed (which is the case of Vector API).

It is a direct consequence of current code snippet representation on C2 
IR level (as a native call). The registers for input and output values 
are fixed, so there's high contention on them when multiple snippets are 
in play.

In order to produce more efficient code, C2 register allocator (RA) 
should be able to rename the registers used in the snippets being inlined.

 From the implementation perspective, JVM Compiler Interface (JVMCI [4]) 
introduced in JDK 9 is a big help here, since it already contains ISA 
description [4]. It can be used as a foundation for an API (akin to C2 
Architecture Description Language [6] [7]) to describe input/output 
register masks and different effects of a code snippet.

I won't dive into implementation details of how snippet code patching 
(according to RA decisions) can be implemented, but I envision 2 ways to 
do that: from JVM (requires machine code parsing support) or on the user 
side (upcall with bound registers which returns final machine code for a 
snippet [9]). It will definitely complicate the prototype, but doesn't 
look as something unfeasible.

Let me know if anybody wants to experiment with it :-)

Best regards,
Vladimir Ivanov

[1] 
http://hg.openjdk.java.net/panama/panama/jdk/file/tip/src/java.base/share/classes/java/lang/Long2.java

[2] 
http://hg.openjdk.java.net/panama/panama/jdk/file/tip/src/java.base/share/classes/java/lang/Long4.java

[3] 
http://hg.openjdk.java.net/panama/panama/jdk/file/tip/src/java.base/share/classes/java/lang/Long8.java

[4] 
http://hg.openjdk.java.net/panama/panama/hotspot/file/tip/src/jdk.vm.ci/share/classes/

[5] 
http://hg.openjdk.java.net/panama/panama/hotspot/file/tip/src/jdk.vm.ci/share/classes/jdk.vm.ci.amd64/src/jdk/vm/ci/amd64/AMD64.java

[6] 
http://hg.openjdk.java.net/panama/panama/hotspot/file/tip/src/share/vm/adlc/Doc/Syntax.doc

[7] 
http://hg.openjdk.java.net/panama/panama/hotspot/file/tip/src/cpu/x86/vm/x86_64.ad

[8] 
http://hg.openjdk.java.net/panama/panama/hotspot/file/e01d58f01f33/src/cpu/x86/vm/x86_64.ad#l11885

[9] 
http://hg.openjdk.java.net/panama/panama/hotspot/file/e01d58f01f33/src/cpu/x86/vm/x86_64.ad#l11885

On 1/22/16 3:31 AM, John Rose wrote:
> Several people have been looking at refining and implementing the straw-man Vector API I proposed[1].  The exchange of December 4 [2] shows some of this work but there's been more!
>
> [1]: http://cr.openjdk.java.net/~jrose/arrays/vector/Vector.java <http://cr.openjdk.java.net/~jrose/arrays/vector/Vector.java>
> [2]: http://mail.openjdk.java.net/pipermail/panama-dev/2015-December/thread.html#225 <http://mail.openjdk.java.net/pipermail/panama-dev/2015-December/thread.html#225>
>
> First, background:  Vector types for Java are an important aspect for Panama, because Panama is about binding Java more closely to hardware and system APIs.  One big reason, today, for binding closely to hardware is to process wider data flows in SIMD modes.  (And IMO this is a long-term trend towards right-sizing data channel widths, as hardware grows wider in various ways.)  Vector.java and AVX bindings are where we are experimenting, today, with busting Java's 64-bit limit.  (In the future this will support a variety of SIMD CPUs, and converge smoothly with Valhalla value types and parametric polymorphism, to create a natural looking set of primitive-like value types for vectors of all types, on all platforms.)
>
> So, recently, Paul Sandoz has created a skeletal web of concrete types that "bottoms out" in a simple byte-based container.  And Ian Graves (of Intel) has whipped up a single-class reference implementation, that uses extra metadata fields to steer around the shapes and element types.  Michael Berg (Intel) is our expert in compiling loops through the HotSpot SuperWord optimizations into the latest and greatest Intel vector instructions. Finally, Vladimir has been working on a strategy for creating quick punches all the way through to the CPU, in a way that should support quick prototyping of vector intrinsics, on all of our platforms.  (Quick prototyping does not mean the feature is of temporary use:  It seems there are always new instructions to grapple with, and there may always be a place for doing so without tight integration with the JIT.  It's the case now, anyway.  In a future Graal-based world, adding plugins for new instructions will be much easier; maybe that will!
  become th
e best way to prototype with new instructions.)
>
> One perhaps surprising aspect of the Vector API is that it factors the complexity of the API by lambda-parameterizing families of methods; the lambdas supply elementwise or reduction operations (as v.map(fn) or v.reduce(fn) instead of v.negate(), v.sum(), etc.).  This is inspired by JDK 8 streams, of course.  IMO, it is an excellent bit of usability and future-proofing, but it has a cost:  Instruction selection for the vector operation requires that the lambda be "cracked" to examine the part of the loop kernel it carries.  The standard way to do this is wait for the JIT to inline everything (and hope it doesn't get tired of inlining halfway through).  When this works, it is spectacularly successful.  It can fail silently.  And it requires special powers and privileges to do the cracking, which are not trivial to define (though .NET has done it successfully).
>
> I want to keep the lambdas, but Ian has pointed out that we need to start by special-casing some of the methods (e.g., lane-wise or cross-lane addition) for demonstration purposes.  That's fine; we did things like that with streams.  To make the lambdas carry the weight, we want (IMO) two things, one short-term and one long-term.  In the long-term we want a lambda cracker, something like our method-handle cracker.  It should be able to completely analyze at least simple lambdas like `(x,y)->(x+y)`, which are really just wrapped `iadd` instructions.  The JVMCI should also provide a robust way to crack them all, but we have to be careful how we use the high privileges (Unsafe-level) required to use the JVMCI.
>
> In the short term, I propose creating a VectorOp interface (and/or VectorOps class) which contains a large number of factories and/or static finals which give alternate notations for the loop kernel components (like `iadd`) required by the Vector API.  It's an under-appreciated fact that lambdas interoperate with interfaces, and a lambda-like value can implement not only a primary interface, but additional secondary ones.  In the case of VectorOps, the IntOperator implementing scalar 32-bit `iadd` should also implement whatever additional interfaces are necessary for the Vector API to properly categorize it as part of a loop.  That means one or both of the following:  1. A metadata API which voluntarily reports the operation.  As, "I am really just iadd", or maybe "I am the following expression tree".  2. An API lifted to vectors, either "I am willing to take 4 sets of inputs at once", or "I am willing to produce a method handle to work on right-sized vectors".  I've poste!
 d an examp
le of what I mean by this [3].
>
> [3]: http://cr.openjdk.java.net/~jrose/arrays/vector/VectorOp.java
>
> Working with crackable lambdas might give us insights into helping JDK 8 streams also, in ways that do not require always-lucky inlining.  Of course we don't want to require users, in general, to use lambdas from a fixed menu (like VectorOps) but working with the side handshakes will help us understand what a general lambda cracker should look like, if it must help vectorize loops.
>
> Now for some thoughts about the vectors themselves.  Basically, I think we should do whatever it takes to make the concrete representation (whether hidden or not) of a Vector to be as simple as possible, even if this requires complicated class structure.  Each individual concrete class must, of course, support full box-elision optimization so aggressive that it can laugh at Lady Luck.  (We'll get there in Valhalla, where boxing is a rare, relatively explicit event, but currently we need to work harder to lift our vectors out of Java's data model, which requires a heap node for every value.)
>
> In the end, I think we will prefer to factor every property about a vector, except for the actual payload bits, into an auxiliary object (shared by all vectors of the same "species"), and use a class-per-species implementation strategy.  This will lead to MxN classes, lazily generated under the covers; Valhalla parametric polymorphism will reduce one or both factors.
>
> (Rationale:  When they become Valhalla value types, they will be pretty large, and it will be best if they fit cleanly into a vector register without extra fields.  Also, for now, class-based data constant-folds more reliably than field-based constant data, the notorious trust-final-fields problem.  In any case, the JVM has lots of class inference, profiling, and speculation logic; the more info. we derive from the concrete class, the more we leverage that mature technology. There's going to be an object header anyway, so let's put it to work!)
>
> Perhaps fast type-specific paths can be mediated by boxed generic entry points (which as a downside rely on always-lucky box elision).  Or, following Paul's VarHandles, they can use method handles for fast type-specific entry points, in which case a species object contains a bundle of preset method handles, and looks very much like a v-table.  That sounds better to me, since MHs require less luck to choose efficient calling sequences.  Valhalla parameter instances will support species more directly (N species per class) but for now we probably want a 1-1 species/class ratio.
>
> The various concrete classes would then extend a small number of intrinsically magic, maximally aligned, pure-data container types (Long2/4/8), adding various parameterized interfaces with default methods, plus (as absolutely necessary) mechanically specialized methods.  (BTW, some of those interfaces, like Vector itself, are public.  Some, like the lambda cracking APIs mentioned above, would be only for internal handshaking—modules give us more options for hiding public names, thus encapsulating interface views despite publicity of methods.)
>
> For maximum reuse and cleanliness, it should be the case that nearly all of the code is either generic (some of it packaged in interface default methods), or else is generated on the fly via indy instructions (in the concrete classes) or via MH-based handshakes originating in the generic code.  Big extra bonus points if we don't resort to ASM, but that's likely only in the Valhalla world, since concrete signatures may need to be adjusted (which is a Valhalla super-power).
>
> Depending on how we use MHs, it the vectors might act as if they are coded in a dynamic language designed for expanding loop kernels.  On a Graal-based stack we might experiment with Truffle to do stuff like that.  For now we need to build on JDK 8 technology for this project, which means closures and indy.
>
> — John
>