implementation of Vector API and associated lambdas

Fri Jan 29 19:05:33 UTC 2016

Ian,

On 1/29/16 4:12 AM, Graves, Ian L wrote:
> The Codesnippets work is really cool.  I'm really excited about what we can build on it.
Thanks! Glad to hear that :-)

> Another issue we’re considering is how to structure an API that maximizes the composability of the primitives that we support.  The work on Codesnippets is definitely something we should build on, but we have to give some concern to the overhead that it requires (un/boxing arguments, support for calling convention, etc.)  I think that there’s room to alleviate these issues at the compilation level, but I’m also curious if we can direct an API to support composition in such a way that allows us to generate platform-level instructions without the boxing overhead (if it’s necessary).  I’m curious about your thoughts on promoting compositionality like this from an API standpoint.  I know there’s some precedent for it with the Stream APIs, which promote a declarative style of function/method composition.  Perhaps there’s something here we can use?
j.l.Long{2,4,8} are designed to eventually become true values and live 
either in registers of appropriate size or on the heap, but without 
object header, like primitives.

But so far, only C2 respects their originality. Neither interpreter, nor 
C1 don't know about anything larger than 64-bit. So, in order to 
preserve them intact, they have to be boxed. It happens whenever the 
compilation boundary is crossed or safepoint is reached.

Eventually (with the rise of user-defined value types, which are 
explored in Project Valhalla) JVM should gain more support and there 
will be much fewer cases when boxing is needed (e.g. interface virtual 
call still needs a receiver object).

So, IMO the lowest level we need to think about in Java code is 
Long{2,4,8} classes and push the JVM to optimize them. I haven't 
invested much time into optimizations for value-based classes. I hope we 
can piggyback on Project Valhalla explorations in the area of VM support 
of value types.

Regarding calling conventions for code snippets, it was just an 
simplifying assumption for the first prototype. I shared some ideas how 
to relax fixed input/output registers in the previous letter.

I did a quick prototype of user-assisted register allocation (regenerate 
the snippet once RA is over) and it works [1] :-)

API:

public static MethodHandle make(
	String name,
	MethodType mt,
	boolean isSupported,
         Register[][] regMasks,
         Function<Register[],int[]> snippetGenerator)

The idea is that a user doesn't provide a raw machine code, but a 
function which produces snippet machine code. Also, it is possible to 
provide register masks

Under the hood, the framework produces a stand-alone version by feeding 
in registers for standard calling convention. And when C2 kicks in and 
tries to inline the snippet, it uses input/output register masks during 
RA and requests the generator a custom version of the snippet which 
satisfies that particular call site.

Example:

   MethodHandle MHm128_add_epi32 = PatchableCodeSnippet.make(
             "m128_add_epi32", MT_L2_BINARY, requires(AVX),
             new Register[][] { xmmRegistersSSE, // out
                                xmmRegistersSSE, // in1
                                xmmRegistersSSE}, // in2
             (Register[] regs) -> {
                 // PADDD xmm1, xmm2, xmm3/m128
                 Register out = regs[0];
                 Register in1 = regs[1];
                 Register in2 = regs[2];

                 int[] vex = vex2(1, (in1 != out ? out.encoding() : 0), 
0, 1);
                 return new int[] {
                         vex[0], vex[1],
                         0xFE,
                         modRM(in1, in2)};
             });

The API is very raw, and doesn't take into account other effects (e.g., 
memory effects, register kills). But it demonstrates the opportunities 
for future improvements.

> Some of the discussions we've had at Intel and with John have suggested compilation-level solutions to this problem that are in line with what Vladimir has suggested.  I like that we're coalescing on similar approaches.  I think between a good API-level approach and support on the back end we can come up with something that's performant and usable.
>
> There has been some on and off mention of Graal and/or JVMCI for prototyping this API.  What are the thoughts on that at this point?
I haven't though much about Graal, but it looks very promising. It would 
definitely simplify JVM <-> Java interactions in the prototype.

Regarding JVMCI, it looks like a perfect fit for our purposes. It 
provides a platform-specific API (e.g., rich CPU-specific info) and a 
number of very useful VM entry points (ability to install generated 
code). Register-allocatable snippets provide a nice example of API 
simplification. The only problem I see is that JVMCI API is hidden - the 
classes live in their own class loader which is isolated from the rest 
of the world and is guarded by -XX:+UseJVMCI flag.

Best regards,
Vladimir Ivanov

[1] 
http://cr.openjdk.java.net/~vlivanov/panama/code_snippets/ra.v00/PatchableVecUtils.java

> -----Original Message-----
> From: panama-dev [mailto:panama-dev-bounces at openjdk.java.net] On Behalf Of Vladimir Ivanov
> Sent: Wednesday, January 27, 2016 9:49 AM
> To: panama-dev at openjdk.java.net
> Subject: Re: implementation of Vector API and associated lambdas
>
> On the platform level, along with machine code snippets prototype, 3 value-based wrappers for 128-, 256-, and 512-bit vector values were added (j.l.Long2 [1], j.l.Long4 [2], and j.l.Long8 [3]).
>
> They are used to represent vector values and JVM does special handling when passing them into code snippets (tries to place the values in vector registers instead of on the heap). Value type experiments are part of Project Valhalla, but as a stop-the-gap solution I decided to introduce custom optimizations specifically for Long2/Long4/Long8. Once JVM understands value types, the vector value wrappers should be marked as such.
>
> For a while, I have been playing with binding Vector API to vector instructions using machine code snippets. The idea is to take a primitive specialization of Vector (IntVector), map j.l.Long2/4/8 to Vector.Shape's, and bind some methods to corresponding vector instructions on x86. My main motivation is to see how it all shapes from abstract API down to machine level with the focus on the quality of generated code. Providing complete implementation is out of scope for now (it requires some form of lambda decomposition John already wrote about).
>
> C2 support for code snippets provides a nice way to embed arbitrary machine code into the code JVM generates on the fly. But the final generated code is far from optimal when snippet-rich code is compiled:
> there are too many spills of vector values when they are passed around.
> Such code shape is usually observed when there are many 1/2-instruction snippets composed (which is the case of Vector API).
>
> It is a direct consequence of current code snippet representation on C2 IR level (as a native call). The registers for input and output values are fixed, so there's high contention on them when multiple snippets are in play.
>
> In order to produce more efficient code, C2 register allocator (RA) should be able to rename the registers used in the snippets being inlined.
>
>   From the implementation perspective, JVM Compiler Interface (JVMCI [4]) introduced in JDK 9 is a big help here, since it already contains ISA description [4]. It can be used as a foundation for an API (akin to C2 Architecture Description Language [6] [7]) to describe input/output register masks and different effects of a code snippet.
>
> I won't dive into implementation details of how snippet code patching (according to RA decisions) can be implemented, but I envision 2 ways to do that: from JVM (requires machine code parsing support) or on the user side (upcall with bound registers which returns final machine code for a snippet [9]). It will definitely complicate the prototype, but doesn't look as something unfeasible.
>
> Let me know if anybody wants to experiment with it :-)
>
> Best regards,
> Vladimir Ivanov
>
> [1]
> http://hg.openjdk.java.net/panama/panama/jdk/file/tip/src/java.base/share/classes/java/lang/Long2.java
>
> [2]
> http://hg.openjdk.java.net/panama/panama/jdk/file/tip/src/java.base/share/classes/java/lang/Long4.java
>
> [3]
> http://hg.openjdk.java.net/panama/panama/jdk/file/tip/src/java.base/share/classes/java/lang/Long8.java
>
> [4]
> http://hg.openjdk.java.net/panama/panama/hotspot/file/tip/src/jdk.vm.ci/share/classes/
>
> [5]
> http://hg.openjdk.java.net/panama/panama/hotspot/file/tip/src/jdk.vm.ci/share/classes/jdk.vm.ci.amd64/src/jdk/vm/ci/amd64/AMD64.java
>
> [6]
> http://hg.openjdk.java.net/panama/panama/hotspot/file/tip/src/share/vm/adlc/Doc/Syntax.doc
>
> [7]
> http://hg.openjdk.java.net/panama/panama/hotspot/file/tip/src/cpu/x86/vm/x86_64.ad
>
> [8]
> http://hg.openjdk.java.net/panama/panama/hotspot/file/e01d58f01f33/src/cpu/x86/vm/x86_64.ad#l11885
>
> [9]
> http://hg.openjdk.java.net/panama/panama/hotspot/file/e01d58f01f33/src/cpu/x86/vm/x86_64.ad#l11885
>
> On 1/22/16 3:31 AM, John Rose wrote:
>> Several people have been looking at refining and implementing the straw-man Vector API I proposed[1].  The exchange of December 4 [2] shows some of this work but there's been more!
>>
>> [1]: http://cr.openjdk.java.net/~jrose/arrays/vector/Vector.java
>> <http://cr.openjdk.java.net/~jrose/arrays/vector/Vector.java>
>> [2]:
>> http://mail.openjdk.java.net/pipermail/panama-dev/2015-December/thread
>> .html#225
>> <http://mail.openjdk.java.net/pipermail/panama-dev/2015-December/threa
>> d.html#225>
>>
>> First, background:  Vector types for Java are an important aspect for
>> Panama, because Panama is about binding Java more closely to hardware
>> and system APIs.  One big reason, today, for binding closely to
>> hardware is to process wider data flows in SIMD modes.  (And IMO this
>> is a long-term trend towards right-sizing data channel widths, as
>> hardware grows wider in various ways.)  Vector.java and AVX bindings
>> are where we are experimenting, today, with busting Java's 64-bit
>> limit.  (In the future this will support a variety of SIMD CPUs, and
>> converge smoothly with Valhalla value types and parametric
>> polymorphism, to create a natural looking set of primitive-like value
>> types for vectors of all types, on all platforms.)
>>
>> So, recently, Paul Sandoz has created a skeletal web of concrete types that "bottoms out" in a simple byte-based container.  And Ian Graves (of Intel) has whipped up a single-class reference implementation, that uses extra metadata fields to steer around the shapes and element types.  Michael Berg (Intel) is our expert in compiling loops through the HotSpot SuperWord optimizations into the latest and greatest Intel vector instructions. Finally, Vladimir has been working on a strategy for creating quick punches all the way through to the CPU, in a way that should support quick prototyping of vector intrinsics, on all of our platforms.  (Quick prototyping does not mean the feature is of temporary use:  It seems there are always new instructions to grapple with, and there may always be a place for doing so without tight integration with the JIT.  It's the case now, anyway.  In a future Graal-based world, adding plugins for new instructions will be much easier; maybe that wil!
 l!
>    become th
> e best way to prototype with new instructions.)
>>
>> One perhaps surprising aspect of the Vector API is that it factors the complexity of the API by lambda-parameterizing families of methods; the lambdas supply elementwise or reduction operations (as v.map(fn) or v.reduce(fn) instead of v.negate(), v.sum(), etc.).  This is inspired by JDK 8 streams, of course.  IMO, it is an excellent bit of usability and future-proofing, but it has a cost:  Instruction selection for the vector operation requires that the lambda be "cracked" to examine the part of the loop kernel it carries.  The standard way to do this is wait for the JIT to inline everything (and hope it doesn't get tired of inlining halfway through).  When this works, it is spectacularly successful.  It can fail silently.  And it requires special powers and privileges to do the cracking, which are not trivial to define (though .NET has done it successfully).
>>
>> I want to keep the lambdas, but Ian has pointed out that we need to start by special-casing some of the methods (e.g., lane-wise or cross-lane addition) for demonstration purposes.  That's fine; we did things like that with streams.  To make the lambdas carry the weight, we want (IMO) two things, one short-term and one long-term.  In the long-term we want a lambda cracker, something like our method-handle cracker.  It should be able to completely analyze at least simple lambdas like `(x,y)->(x+y)`, which are really just wrapped `iadd` instructions.  The JVMCI should also provide a robust way to crack them all, but we have to be careful how we use the high privileges (Unsafe-level) required to use the JVMCI.
>>
>> In the short term, I propose creating a VectorOp interface (and/or VectorOps class) which contains a large number of factories and/or static finals which give alternate notations for the loop kernel components (like `iadd`) required by the Vector API.  It's an under-appreciated fact that lambdas interoperate with interfaces, and a lambda-like value can implement not only a primary interface, but additional secondary ones.  In the case of VectorOps, the IntOperator implementing scalar 32-bit `iadd` should also implement whatever additional interfaces are necessary for the Vector API to properly categorize it as part of a loop.  That means one or both of the following:  1. A metadata API which voluntarily reports the operation.  As, "I am really just iadd", or maybe "I am the following expression tree".  2. An API lifted to vectors, either "I am willing to take 4 sets of inputs at once", or "I am willing to produce a method handle to work on right-sized vectors".  I've post!
 e!
>   d an examp
> le of what I mean by this [3].
>>
>> [3]: http://cr.openjdk.java.net/~jrose/arrays/vector/VectorOp.java
>>
>> Working with crackable lambdas might give us insights into helping JDK 8 streams also, in ways that do not require always-lucky inlining.  Of course we don't want to require users, in general, to use lambdas from a fixed menu (like VectorOps) but working with the side handshakes will help us understand what a general lambda cracker should look like, if it must help vectorize loops.
>>
>> Now for some thoughts about the vectors themselves.  Basically, I
>> think we should do whatever it takes to make the concrete
>> representation (whether hidden or not) of a Vector to be as simple as
>> possible, even if this requires complicated class structure.  Each
>> individual concrete class must, of course, support full box-elision
>> optimization so aggressive that it can laugh at Lady Luck.  (We'll get
>> there in Valhalla, where boxing is a rare, relatively explicit event,
>> but currently we need to work harder to lift our vectors out of Java's
>> data model, which requires a heap node for every value.)
>>
>> In the end, I think we will prefer to factor every property about a vector, except for the actual payload bits, into an auxiliary object (shared by all vectors of the same "species"), and use a class-per-species implementation strategy.  This will lead to MxN classes, lazily generated under the covers; Valhalla parametric polymorphism will reduce one or both factors.
>>
>> (Rationale:  When they become Valhalla value types, they will be
>> pretty large, and it will be best if they fit cleanly into a vector
>> register without extra fields.  Also, for now, class-based data
>> constant-folds more reliably than field-based constant data, the
>> notorious trust-final-fields problem.  In any case, the JVM has lots
>> of class inference, profiling, and speculation logic; the more info.
>> we derive from the concrete class, the more we leverage that mature
>> technology. There's going to be an object header anyway, so let's put
>> it to work!)
>>
>> Perhaps fast type-specific paths can be mediated by boxed generic entry points (which as a downside rely on always-lucky box elision).  Or, following Paul's VarHandles, they can use method handles for fast type-specific entry points, in which case a species object contains a bundle of preset method handles, and looks very much like a v-table.  That sounds better to me, since MHs require less luck to choose efficient calling sequences.  Valhalla parameter instances will support species more directly (N species per class) but for now we probably want a 1-1 species/class ratio.
>>
>> The various concrete classes would then extend a small number of
>> intrinsically magic, maximally aligned, pure-data container types
>> (Long2/4/8), adding various parameterized interfaces with default
>> methods, plus (as absolutely necessary) mechanically specialized
>> methods.  (BTW, some of those interfaces, like Vector itself, are
>> public.  Some, like the lambda cracking APIs mentioned above, would be
>> only for internal handshaking—modules give us more options for hiding
>> public names, thus encapsulating interface views despite publicity of
>> methods.)
>>
>> For maximum reuse and cleanliness, it should be the case that nearly all of the code is either generic (some of it packaged in interface default methods), or else is generated on the fly via indy instructions (in the concrete classes) or via MH-based handshakes originating in the generic code.  Big extra bonus points if we don't resort to ASM, but that's likely only in the Valhalla world, since concrete signatures may need to be adjusted (which is a Valhalla super-power).
>>
>> Depending on how we use MHs, it the vectors might act as if they are coded in a dynamic language designed for expanding loop kernels.  On a Graal-based stack we might experiment with Truffle to do stuff like that.  For now we need to build on JDK 8 technology for this project, which means closures and indy.
>>
>> — John
>>