implementation of Vector API and associated lambdas

Tue Feb 2 13:20:52 UTC 2016

More thoughts about Vector API in particular.

> Like John said in the intro email, we’ve both been doing work in sketching out an API based on the straw man proposal he laid out in late 2015.  My initial sketch is a single class that constructed Vectors given some type/class-value arguments (in lieu of species).  After taking this route, it became pretty clear that A.) Vectors need to remember their species to facilitate functionality such as element-wise getting and putting and B.) There are more efficient ways of encoding this “knowledge of self” than in a field.  Paul's approach that employs a factory model seems appealing in this regard, but I wonder about the potential NxM problem that occurs if one requires a concrete class for every pair of element types and vector sizes.  Given we know the number of supported numeric types (I,J,F,D) and vector sizes (for now [128,256,512]), the upper bound for species is 12.  Perhaps this is an acceptable number of classes to maintain right now?
It shouldn't be a problem.

Bytecode specialization at runtime is a recognized solution to get 
performant implementation.

JVM should work hard to produce optimal code for generalized API. 
Unfortuantely, sometimes it leads to fragile or missed optimizations 
(due to profile pollution or heuristics quirks). Proper type 
specialization eliminates profile pollution, so JVM can see the whole 
picture and act accordingly.

Performing specialization at runtime allows to do it lazily and choose 
the strategy according to available mechanisms and evolve with the JVM 
when it becomes more powerful and obtains new capabilities.

Even if we find out later that there's a need to customize for other 
aspects (>>12 species), lazy customization significantly reduces actual 
number of species present at runtime.

Such tricks are common in java.lang.invoke framework, which heavily 
relies on JVM to efficiently compile MethodHandle usages.

For Vector API it is important to specialize for element type & vector 
size in order to be able to map operations to vector instructions.

The API clearly separates typed & untyped vectors:
   Vector.Shape == j.l.i.Long2/4/8
   Vector.Species == element type + vector size

So, the question is how to map operations on elements to 
platform-specific instructions. VectorOp seems a viable solution to 
reliably detect the intended behavior and use appropriate instruction 
from the available repertoire.

But I haven't thought much about how to implement it. Either introduce 
special API entry points which consume VectorOp.Op (simplest?) or 
produce specialized lambdas for VectorOp.Op on per-vector species basis 
(more optimizable?).

Machine code snippets are intended to be used on the lowest-level: once 
the mapping is set, execute appropriate platform-specific instruction by 
invoking a corresponding snippet (represented as a MethodHandle).

With proper amount of specialization and ability to crack lambdas, for 
Vector API pipelines the JVM should be able to reliably detect what 
snippets are used and replace them with corresponding instructions in 
generated code.

> Another issue we’re considering is how to structure an API that maximizes the composability of the primitives that we support.  The work on Codesnippets is definitely something we should build on, but we have to give some concern to the overhead that it requires (un/boxing arguments, support for calling convention, etc.)  I think that there’s room to alleviate these issues at the compilation level, but I’m also curious if we can direct an API to support composition in such a way that allows us to generate platform-level instructions without the boxing overhead (if it’s necessary).  I’m curious about your thoughts on promoting compositionality like this from an API standpoint.  I know there’s some precedent for it with the Stream APIs, which promote a declarative style of function/method composition.  Perhaps there’s something here we can use?
Specialization for vector species (element type + vector size) should 
provide a way to optimal code w/ platform-specific vector instructions.

Unfortunately, there's no better way to represent vectors on bytecode 
level than Long2/4/8 value-based wrappers. JVM simply doesn't have 
int128_t/int256_t/int512_t-equivalent primitives support. So, boxing (in 
some quantity) is inevitable for now. But, C2 support for redundant 
boxing/unboxing should alleviate most of the performance problems. And 
the larger Vector API pipelines are, the less boxing/unboxing should 
happen in the optimized code.

Vector w/ per-element operations (map*, test, add, neg, ...) + VectorOp 
(for lambda introspection) seems like a good fit for composable API.

Regarding Stream API, it doesn't look too attractive except for 
interoperability. Stream fusion (perform the pipeline in a single pass) 
doesn't help vectors: it is better to apply the pipeline step-by-step on 
the whole vector than splitting the vector into pieces.

What does look viable is to add Vector API-directed optimizations to 
Stream API implementation - take {Int,Long,Double}Stream + 
VectorOp-based operations and using vector-aware spliterator translate a 
pipeline on primitives into a pipeline on vectors.

Best regards,
Vladimir Ivanov

>
> --Ian
>
> -----Original Message-----
> From: panama-dev [mailto:panama-dev-bounces at openjdk.java.net] On Behalf Of Vladimir Ivanov
> Sent: Wednesday, January 27, 2016 9:49 AM
> To: panama-dev at openjdk.java.net
> Subject: Re: implementation of Vector API and associated lambdas
>
> On the platform level, along with machine code snippets prototype, 3 value-based wrappers for 128-, 256-, and 512-bit vector values were added (j.l.Long2 [1], j.l.Long4 [2], and j.l.Long8 [3]).
>
> They are used to represent vector values and JVM does special handling when passing them into code snippets (tries to place the values in vector registers instead of on the heap). Value type experiments are part of Project Valhalla, but as a stop-the-gap solution I decided to introduce custom optimizations specifically for Long2/Long4/Long8. Once JVM understands value types, the vector value wrappers should be marked as such.
>
> For a while, I have been playing with binding Vector API to vector instructions using machine code snippets. The idea is to take a primitive specialization of Vector (IntVector), map j.l.Long2/4/8 to Vector.Shape's, and bind some methods to corresponding vector instructions on x86. My main motivation is to see how it all shapes from abstract API down to machine level with the focus on the quality of generated code. Providing complete implementation is out of scope for now (it requires some form of lambda decomposition John already wrote about).
>
> C2 support for code snippets provides a nice way to embed arbitrary machine code into the code JVM generates on the fly. But the final generated code is far from optimal when snippet-rich code is compiled:
> there are too many spills of vector values when they are passed around.
> Such code shape is usually observed when there are many 1/2-instruction snippets composed (which is the case of Vector API).
>
> It is a direct consequence of current code snippet representation on C2 IR level (as a native call). The registers for input and output values are fixed, so there's high contention on them when multiple snippets are in play.
>
> In order to produce more efficient code, C2 register allocator (RA) should be able to rename the registers used in the snippets being inlined.
>
>   From the implementation perspective, JVM Compiler Interface (JVMCI [4]) introduced in JDK 9 is a big help here, since it already contains ISA description [4]. It can be used as a foundation for an API (akin to C2 Architecture Description Language [6] [7]) to describe input/output register masks and different effects of a code snippet.
>
> I won't dive into implementation details of how snippet code patching (according to RA decisions) can be implemented, but I envision 2 ways to do that: from JVM (requires machine code parsing support) or on the user side (upcall with bound registers which returns final machine code for a snippet [9]). It will definitely complicate the prototype, but doesn't look as something unfeasible.
>
> Let me know if anybody wants to experiment with it :-)
>
> Best regards,
> Vladimir Ivanov
>
> [1]
> http://hg.openjdk.java.net/panama/panama/jdk/file/tip/src/java.base/share/classes/java/lang/Long2.java
>
> [2]
> http://hg.openjdk.java.net/panama/panama/jdk/file/tip/src/java.base/share/classes/java/lang/Long4.java
>
> [3]
> http://hg.openjdk.java.net/panama/panama/jdk/file/tip/src/java.base/share/classes/java/lang/Long8.java
>
> [4]
> http://hg.openjdk.java.net/panama/panama/hotspot/file/tip/src/jdk.vm.ci/share/classes/
>
> [5]
> http://hg.openjdk.java.net/panama/panama/hotspot/file/tip/src/jdk.vm.ci/share/classes/jdk.vm.ci.amd64/src/jdk/vm/ci/amd64/AMD64.java
>
> [6]
> http://hg.openjdk.java.net/panama/panama/hotspot/file/tip/src/share/vm/adlc/Doc/Syntax.doc
>
> [7]
> http://hg.openjdk.java.net/panama/panama/hotspot/file/tip/src/cpu/x86/vm/x86_64.ad
>
> [8]
> http://hg.openjdk.java.net/panama/panama/hotspot/file/e01d58f01f33/src/cpu/x86/vm/x86_64.ad#l11885
>
> [9]
> http://hg.openjdk.java.net/panama/panama/hotspot/file/e01d58f01f33/src/cpu/x86/vm/x86_64.ad#l11885
>
> On 1/22/16 3:31 AM, John Rose wrote:
>> Several people have been looking at refining and implementing the straw-man Vector API I proposed[1].  The exchange of December 4 [2] shows some of this work but there's been more!
>>
>> [1]: http://cr.openjdk.java.net/~jrose/arrays/vector/Vector.java
>> <http://cr.openjdk.java.net/~jrose/arrays/vector/Vector.java>
>> [2]:
>> http://mail.openjdk.java.net/pipermail/panama-dev/2015-December/thread
>> .html#225
>> <http://mail.openjdk.java.net/pipermail/panama-dev/2015-December/threa
>> d.html#225>
>>
>> First, background:  Vector types for Java are an important aspect for
>> Panama, because Panama is about binding Java more closely to hardware
>> and system APIs.  One big reason, today, for binding closely to
>> hardware is to process wider data flows in SIMD modes.  (And IMO this
>> is a long-term trend towards right-sizing data channel widths, as
>> hardware grows wider in various ways.)  Vector.java and AVX bindings
>> are where we are experimenting, today, with busting Java's 64-bit
>> limit.  (In the future this will support a variety of SIMD CPUs, and
>> converge smoothly with Valhalla value types and parametric
>> polymorphism, to create a natural looking set of primitive-like value
>> types for vectors of all types, on all platforms.)
>>
>> So, recently, Paul Sandoz has created a skeletal web of concrete types that "bottoms out" in a simple byte-based container.  And Ian Graves (of Intel) has whipped up a single-class reference implementation, that uses extra metadata fields to steer around the shapes and element types.  Michael Berg (Intel) is our expert in compiling loops through the HotSpot SuperWord optimizations into the latest and greatest Intel vector instructions. Finally, Vladimir has been working on a strategy for creating quick punches all the way through to the CPU, in a way that should support quick prototyping of vector intrinsics, on all of our platforms.  (Quick prototyping does not mean the feature is of temporary use:  It seems there are always new instructions to grapple with, and there may always be a place for doing so without tight integration with the JIT.  It's the case now, anyway.  In a future Graal-based world, adding plugins for new instructions will be much easier; maybe that wil!
 l!
>    become th
> e best way to prototype with new instructions.)
>>
>> One perhaps surprising aspect of the Vector API is that it factors the complexity of the API by lambda-parameterizing families of methods; the lambdas supply elementwise or reduction operations (as v.map(fn) or v.reduce(fn) instead of v.negate(), v.sum(), etc.).  This is inspired by JDK 8 streams, of course.  IMO, it is an excellent bit of usability and future-proofing, but it has a cost:  Instruction selection for the vector operation requires that the lambda be "cracked" to examine the part of the loop kernel it carries.  The standard way to do this is wait for the JIT to inline everything (and hope it doesn't get tired of inlining halfway through).  When this works, it is spectacularly successful.  It can fail silently.  And it requires special powers and privileges to do the cracking, which are not trivial to define (though .NET has done it successfully).
>>
>> I want to keep the lambdas, but Ian has pointed out that we need to start by special-casing some of the methods (e.g., lane-wise or cross-lane addition) for demonstration purposes.  That's fine; we did things like that with streams.  To make the lambdas carry the weight, we want (IMO) two things, one short-term and one long-term.  In the long-term we want a lambda cracker, something like our method-handle cracker.  It should be able to completely analyze at least simple lambdas like `(x,y)->(x+y)`, which are really just wrapped `iadd` instructions.  The JVMCI should also provide a robust way to crack them all, but we have to be careful how we use the high privileges (Unsafe-level) required to use the JVMCI.
>>
>> In the short term, I propose creating a VectorOp interface (and/or VectorOps class) which contains a large number of factories and/or static finals which give alternate notations for the loop kernel components (like `iadd`) required by the Vector API.  It's an under-appreciated fact that lambdas interoperate with interfaces, and a lambda-like value can implement not only a primary interface, but additional secondary ones.  In the case of VectorOps, the IntOperator implementing scalar 32-bit `iadd` should also implement whatever additional interfaces are necessary for the Vector API to properly categorize it as part of a loop.  That means one or both of the following:  1. A metadata API which voluntarily reports the operation.  As, "I am really just iadd", or maybe "I am the following expression tree".  2. An API lifted to vectors, either "I am willing to take 4 sets of inputs at once", or "I am willing to produce a method handle to work on right-sized vectors".  I've post!
 e!
>   d an examp
> le of what I mean by this [3].
>>
>> [3]: http://cr.openjdk.java.net/~jrose/arrays/vector/VectorOp.java
>>
>> Working with crackable lambdas might give us insights into helping JDK 8 streams also, in ways that do not require always-lucky inlining.  Of course we don't want to require users, in general, to use lambdas from a fixed menu (like VectorOps) but working with the side handshakes will help us understand what a general lambda cracker should look like, if it must help vectorize loops.
>>
>> Now for some thoughts about the vectors themselves.  Basically, I
>> think we should do whatever it takes to make the concrete
>> representation (whether hidden or not) of a Vector to be as simple as
>> possible, even if this requires complicated class structure.  Each
>> individual concrete class must, of course, support full box-elision
>> optimization so aggressive that it can laugh at Lady Luck.  (We'll get
>> there in Valhalla, where boxing is a rare, relatively explicit event,
>> but currently we need to work harder to lift our vectors out of Java's
>> data model, which requires a heap node for every value.)
>>
>> In the end, I think we will prefer to factor every property about a vector, except for the actual payload bits, into an auxiliary object (shared by all vectors of the same "species"), and use a class-per-species implementation strategy.  This will lead to MxN classes, lazily generated under the covers; Valhalla parametric polymorphism will reduce one or both factors.
>>
>> (Rationale:  When they become Valhalla value types, they will be
>> pretty large, and it will be best if they fit cleanly into a vector
>> register without extra fields.  Also, for now, class-based data
>> constant-folds more reliably than field-based constant data, the
>> notorious trust-final-fields problem.  In any case, the JVM has lots
>> of class inference, profiling, and speculation logic; the more info.
>> we derive from the concrete class, the more we leverage that mature
>> technology. There's going to be an object header anyway, so let's put
>> it to work!)
>>
>> Perhaps fast type-specific paths can be mediated by boxed generic entry points (which as a downside rely on always-lucky box elision).  Or, following Paul's VarHandles, they can use method handles for fast type-specific entry points, in which case a species object contains a bundle of preset method handles, and looks very much like a v-table.  That sounds better to me, since MHs require less luck to choose efficient calling sequences.  Valhalla parameter instances will support species more directly (N species per class) but for now we probably want a 1-1 species/class ratio.
>>
>> The various concrete classes would then extend a small number of
>> intrinsically magic, maximally aligned, pure-data container types
>> (Long2/4/8), adding various parameterized interfaces with default
>> methods, plus (as absolutely necessary) mechanically specialized
>> methods.  (BTW, some of those interfaces, like Vector itself, are
>> public.  Some, like the lambda cracking APIs mentioned above, would be
>> only for internal handshaking—modules give us more options for hiding
>> public names, thus encapsulating interface views despite publicity of
>> methods.)
>>
>> For maximum reuse and cleanliness, it should be the case that nearly all of the code is either generic (some of it packaged in interface default methods), or else is generated on the fly via indy instructions (in the concrete classes) or via MH-based handshakes originating in the generic code.  Big extra bonus points if we don't resort to ASM, but that's likely only in the Valhalla world, since concrete signatures may need to be adjusted (which is a Valhalla super-power).
>>
>> Depending on how we use MHs, it the vectors might act as if they are coded in a dynamic language designed for expanding loop kernels.  On a Graal-based stack we might experiment with Truffle to do stuff like that.  For now we need to build on JDK 8 technology for this project, which means closures and indy.
>>
>> — John
>>