implementation of Vector API and associated lambdas

Fri Jan 22 15:02:38 UTC 2016

Also, on a slight tangent (sorry), is there a rough idea of when someone
will work on addressing the various long standing JBS entries around
brittle/suboptimal inlining? Given how much inlining matters for Java (much
more so than say C or C++), it'd be nice if some forward progress was made
on them.

Thanks

On Friday, January 22, 2016, Vitaly Davidovich <vitalyd at gmail.com> wrote:

> Lots of good thoughts here John and I'm excited you guys are exploring
> this.
>
> I'd like to add that cracking lambdas would be beneficial in general.
> Right now, for precisely the reasons you mentioned, one has to avoid them
> in hot paths even when they make the code and API cleaner, which is very
> unfortunate.
>
> On Thursday, January 21, 2016, John Rose <john.r.rose at oracle.com
> <javascript:_e(%7B%7D,'cvml','john.r.rose at oracle.com');>> wrote:
>
>> Several people have been looking at refining and implementing the
>> straw-man Vector API I proposed[1].  The exchange of December 4 [2] shows
>> some of this work but there's been more!
>>
>> [1]: http://cr.openjdk.java.net/~jrose/arrays/vector/Vector.java <
>> http://cr.openjdk.java.net/~jrose/arrays/vector/Vector.java>
>> [2]:
>> http://mail.openjdk.java.net/pipermail/panama-dev/2015-December/thread.html#225
>> <
>> http://mail.openjdk.java.net/pipermail/panama-dev/2015-December/thread.html#225
>> >
>>
>> First, background:  Vector types for Java are an important aspect for
>> Panama, because Panama is about binding Java more closely to hardware and
>> system APIs.  One big reason, today, for binding closely to hardware is to
>> process wider data flows in SIMD modes.  (And IMO this is a long-term trend
>> towards right-sizing data channel widths, as hardware grows wider in
>> various ways.)  Vector.java and AVX bindings are where we are
>> experimenting, today, with busting Java's 64-bit limit.  (In the future
>> this will support a variety of SIMD CPUs, and converge smoothly with
>> Valhalla value types and parametric polymorphism, to create a natural
>> looking set of primitive-like value types for vectors of all types, on all
>> platforms.)
>>
>> So, recently, Paul Sandoz has created a skeletal web of concrete types
>> that "bottoms out" in a simple byte-based container.  And Ian Graves (of
>> Intel) has whipped up a single-class reference implementation, that uses
>> extra metadata fields to steer around the shapes and element types.
>> Michael Berg (Intel) is our expert in compiling loops through the HotSpot
>> SuperWord optimizations into the latest and greatest Intel vector
>> instructions. Finally, Vladimir has been working on a strategy for creating
>> quick punches all the way through to the CPU, in a way that should support
>> quick prototyping of vector intrinsics, on all of our platforms.  (Quick
>> prototyping does not mean the feature is of temporary use:  It seems there
>> are always new instructions to grapple with, and there may always be a
>> place for doing so without tight integration with the JIT.  It's the case
>> now, anyway.  In a future Graal-based world, adding plugins for new
>> instructions will be much easier; maybe that will become the best way to
>> prototype with new instructions.)
>>
>> One perhaps surprising aspect of the Vector API is that it factors the
>> complexity of the API by lambda-parameterizing families of methods; the
>> lambdas supply elementwise or reduction operations (as v.map(fn) or
>> v.reduce(fn) instead of v.negate(), v.sum(), etc.).  This is inspired by
>> JDK 8 streams, of course.  IMO, it is an excellent bit of usability and
>> future-proofing, but it has a cost:  Instruction selection for the vector
>> operation requires that the lambda be "cracked" to examine the part of the
>> loop kernel it carries.  The standard way to do this is wait for the JIT to
>> inline everything (and hope it doesn't get tired of inlining halfway
>> through).  When this works, it is spectacularly successful.  It can fail
>> silently.  And it requires special powers and privileges to do the
>> cracking, which are not trivial to define (though .NET has done it
>> successfully).
>>
>> I want to keep the lambdas, but Ian has pointed out that we need to start
>> by special-casing some of the methods (e.g., lane-wise or cross-lane
>> addition) for demonstration purposes.  That's fine; we did things like that
>> with streams.  To make the lambdas carry the weight, we want (IMO) two
>> things, one short-term and one long-term.  In the long-term we want a
>> lambda cracker, something like our method-handle cracker.  It should be
>> able to completely analyze at least simple lambdas like `(x,y)->(x+y)`,
>> which are really just wrapped `iadd` instructions.  The JVMCI should also
>> provide a robust way to crack them all, but we have to be careful how we
>> use the high privileges (Unsafe-level) required to use the JVMCI.
>>
>> In the short term, I propose creating a VectorOp interface (and/or
>> VectorOps class) which contains a large number of factories and/or static
>> finals which give alternate notations for the loop kernel components (like
>> `iadd`) required by the Vector API.  It's an under-appreciated fact that
>> lambdas interoperate with interfaces, and a lambda-like value can implement
>> not only a primary interface, but additional secondary ones.  In the case
>> of VectorOps, the IntOperator implementing scalar 32-bit `iadd` should also
>> implement whatever additional interfaces are necessary for the Vector API
>> to properly categorize it as part of a loop.  That means one or both of the
>> following:  1. A metadata API which voluntarily reports the operation.  As,
>> "I am really just iadd", or maybe "I am the following expression tree".  2.
>> An API lifted to vectors, either "I am willing to take 4 sets of inputs at
>> once", or "I am willing to produce a method handle to work on right-sized
>> vectors".  I've posted an example of what I mean by this [3].
>>
>> [3]: http://cr.openjdk.java.net/~jrose/arrays/vector/VectorOp.java
>>
>> Working with crackable lambdas might give us insights into helping JDK 8
>> streams also, in ways that do not require always-lucky inlining.  Of course
>> we don't want to require users, in general, to use lambdas from a fixed
>> menu (like VectorOps) but working with the side handshakes will help us
>> understand what a general lambda cracker should look like, if it must help
>> vectorize loops.
>>
>> Now for some thoughts about the vectors themselves.  Basically, I think
>> we should do whatever it takes to make the concrete representation (whether
>> hidden or not) of a Vector to be as simple as possible, even if this
>> requires complicated class structure.  Each individual concrete class must,
>> of course, support full box-elision optimization so aggressive that it can
>> laugh at Lady Luck.  (We'll get there in Valhalla, where boxing is a rare,
>> relatively explicit event, but currently we need to work harder to lift our
>> vectors out of Java's data model, which requires a heap node for every
>> value.)
>>
>> In the end, I think we will prefer to factor every property about a
>> vector, except for the actual payload bits, into an auxiliary object
>> (shared by all vectors of the same "species"), and use a class-per-species
>> implementation strategy.  This will lead to MxN classes, lazily generated
>> under the covers; Valhalla parametric polymorphism will reduce one or both
>> factors.
>>
>> (Rationale:  When they become Valhalla value types, they will be pretty
>> large, and it will be best if they fit cleanly into a vector register
>> without extra fields.  Also, for now, class-based data constant-folds more
>> reliably than field-based constant data, the notorious trust-final-fields
>> problem.  In any case, the JVM has lots of class inference, profiling, and
>> speculation logic; the more info. we derive from the concrete class, the
>> more we leverage that mature technology. There's going to be an object
>> header anyway, so let's put it to work!)
>>
>> Perhaps fast type-specific paths can be mediated by boxed generic entry
>> points (which as a downside rely on always-lucky box elision).  Or,
>> following Paul's VarHandles, they can use method handles for fast
>> type-specific entry points, in which case a species object contains a
>> bundle of preset method handles, and looks very much like a v-table.  That
>> sounds better to me, since MHs require less luck to choose efficient
>> calling sequences.  Valhalla parameter instances will support species more
>> directly (N species per class) but for now we probably want a 1-1
>> species/class ratio.
>>
>> The various concrete classes would then extend a small number of
>> intrinsically magic, maximally aligned, pure-data container types
>> (Long2/4/8), adding various parameterized interfaces with default methods,
>> plus (as absolutely necessary) mechanically specialized methods.  (BTW,
>> some of those interfaces, like Vector itself, are public.  Some, like the
>> lambda cracking APIs mentioned above, would be only for internal
>> handshaking—modules give us more options for hiding public names, thus
>> encapsulating interface views despite publicity of methods.)
>>
>> For maximum reuse and cleanliness, it should be the case that nearly all
>> of the code is either generic (some of it packaged in interface default
>> methods), or else is generated on the fly via indy instructions (in the
>> concrete classes) or via MH-based handshakes originating in the generic
>> code.  Big extra bonus points if we don't resort to ASM, but that's likely
>> only in the Valhalla world, since concrete signatures may need to be
>> adjusted (which is a Valhalla super-power).
>>
>> Depending on how we use MHs, it the vectors might act as if they are
>> coded in a dynamic language designed for expanding loop kernels.  On a
>> Graal-based stack we might experiment with Truffle to do stuff like that.
>> For now we need to build on JDK 8 technology for this project, which means
>> closures and indy.
>>
>> — John
>
>
>
> --
> Sent from my phone
>

-- 
Sent from my phone