decoupling interpreter buffers from the JIT's value processing

Thu Jul 6 11:52:03 UTC 2017

I hope Roland & Tobias will chime in, but I'll try to comment based on 
my limited understanding and brief explorations of MVT I did so far.

Short summary: I believe that special representation of buffered values 
in C2 IR and aggressive "rebuffering" during parsing should be enough to 
provide reliable elimination of temporal buffer allocation for MVT purposes.

First, I'd separate non-blocking allocation of value buffers and buffer 
allocation representation in JIT IR. I believe that proper 
representation should be enough to solve the task of temporal allocation 
elimination. Absence of non-blocking buffer allocation complicates the 
task, but doesn't block the solution.

IMO the missing piece right now is special representation for buffered 
values of ValueTypePtr type in C2 IR.

Taking into account that it's hard to add allocations on demand (since 
JVM state is required) and multi-step nature of inlining process (late 
inlining of method handles, see Parse::do_call() [1]), C2 does 
conservative buffer allocations during parsing (see ). Unfortunately, 
current representation doesn't differ from ordinary object allocations 
(see ValueTypeNode::allocate() [2]). So, in the end, the problem of 
buffer elimination quickly (& very early) transforms into elimination of 
ordinary object allocations. Though the information that it's a VT 
buffer isn't completely lost (the type is still there), (1) it's hard to 
use it and (2) buffers can leak (e.g., into debug info), which 
complicates EA.

There's one crucial simplifying assumption about __Value usages in LFs 
we can make: lambda forms don't merge VT pointers of different types. 
So, it's either merge of opaque value buffer pointers (PHI __Value* ... 
__Value* #__Value*) (when compiling stand-alone lambda forms) or merging 
concrete VTs of the same type (PHI VT1* ... VT1* #VT1*) (when inlining 
through method handle chains).

In other words, merges of 2 concrete VTs of different types (PHI VT1* 
... VT2* #__Value*) doesn't happen in practice.

How could actual representation look like?

For example, ValueTypeNode::allocate() can wrap the allocated buffer in 
new kind of node (ValueTypePtrNode):

   VTP == ValueTypePtrNode # in: (buffer, f1, ..., fn), type: VT*

It mimics ValueTypeNode, but has a different type:

   VT  == ValueTypeNode # in: (buffer, f1, ..., fn), type: VT

New representation should significantly simplify transformations on 
buffers and, considering the simplifying assumption, should match the 
level optimizations of concrete VT types:

   * load a field
     Load fi (VTP ... vfi ... #VT1*) => vfi

   * merging VTs through phi
     PHI (VTP1 ... VTPi) #VT1* =>
     VTP (PHI (buf1 ... bufi) ... PHI (vfn1 ... vfni)) #VT1*

   * rematerialization at safepoints
     SafePoint ( ... VTP (_ vf1 ... vfi) #VT1* ...) =>
     SafePoint ( ... SafePointScalarObjectNode (VT1 vf1 ... vfi) ... )

   * rebuffering
     VTP (old_buf vfn1 ... vfni) #VT1* => VTP (new_buf vfn1 ... vfni)
       where
         new_buf = CheckCastPP #VT1* alloc
         alloc   = Allocate (VT1)
         init    = Initialize (alloc vfn1...vfni)

   * redundant buffer allocation elimination
     (VTP buf ... => VTP TOP ...) + subsequent EA pass

With aggressive "rebuffering" (or buffer "splitting", as you called it), 
I believe we can get decent & reliable results optimizing away redundant 
buffer allocations even without non-blocking buffer allocations and 
ability to move allocations around.

Best regards,
Vladimir Ivanov

[1] 
http://hg.openjdk.java.net/valhalla/valhalla/hotspot/file/afb92975d64c/src/share/vm/opto/doCall.cpp#l673

[2] 
http://hg.openjdk.java.net/valhalla/valhalla/hotspot/file/afb92975d64c/src/share/vm/opto/valuetypenode.cpp#l206

On 7/5/17 10:31 PM, John Rose wrote:
> TL;DR I think we need non-blocking value buffers in the JIT.
> 
> This morning we discussed some details of removing buffers from optimized code.
> The problem is that the interpreter generates buffers for values (so it can treat them
> uniformly) but the JIT wants to disregard them in order to optimize fully.  To do this,
> the JIT has to (more or less routinely) unpack the actual values from the interpreter
> buffers.  It's important to do this in order to do value numbering optimizations on
> the values themselves, rather than on the addresses of the buffers that the interpreter
> assigns to carry the values.  (One value can be carried by several buffers.)
> 
> This manifests in the IR as parallel sets of IR nodes.  Some are buffer addresses,
> artifacts of the interpreter's decisions about buffering, and some nodes represent
> unbuffered values (plain values, values per se), which can be register allocated,
> spilled to the stack, passed as spread arguments, etc.  Let's call these two sets
> "buffered values" and "plain values".  They have different types in the IR, and
> allocate to different sorts of registers.  (A plain value allocates in general to a
> tuple of registers.)
> 
> If there is control flow in the IR, then the phi nodes which carry control flow
> dependencies of value types are cloned into separate phis, some for buffered
> values, and some for plain values.   The basic requirement for the the optimizer
> is that the buffered value phis never get in the way of optimization.  Specifically,
> even if the interpreter constructs a new buffer each time through a hot loop,
> the JIT must be free to discard the phis which represent that buffer.
> 
> There are edge cases where buffers are still needed.  AFAIK, the only important
> edge cases are "exits" from the optimized code, either returns or deoptimizations.
> In those cases, an outgoing value will need to be put into a buffer so that the
> interpreter can receive it, either as a return value or as part of a JVM state
> loaded into an interpreter frame that is the product of a deoptimization event.
> 
> All this is almost exactly the same as the techniques we already use to
> scalarize non-escaping POJOs.  The POJOs are reconstructed as true
> heap objects in the case of (some) deoptimizations; this is implemented
> by a special kind of value specifier which guides the deopt. transition code
> to create the POJOs from scattered components, as JVM interpreter frames
> are populated.
> 
> A key difference between the existing EA framework and the new mechanisms
> is that our existing infrastructure is very conservative about tracking the identity
> of references.  This makes it hard to just throw away the phis that carry the
> object references, even if we have a full model of the object components in
> a parallel set of phis.
> 
> I think a good first move for adapting our existing IR framework to the task of
> buffer elimination is to aggressively "split" buffers (just as we currently "split"
> live ranges) at all points where it makes it easier for us to cut away IR nodes
> that only do buffer tracking.  This starts with exit points:  Value returns, and
> debug info about JVM states containing values.  But perhaps every phi that
> carries a buffer should also be split, carrying a virtual buffer-copying operation.
> This latter move would make buffer-carrying phis to immediately go dead,
> since each (virtual) buffer-copying operation would just pick up the value
> components from the corresponding value-tracking phis.
> 
> It might seem that a better long-term move for adapting our IR framework
> would be to refuse to translate interpreter buffering operations into JIT IR,
> and aggressively work *only* with components.  The advantage would be
> that buffering would appear in the IR only as an edge effect:  Components
> are loaded from buffers on entry to the graph and are dumped into buffers
> (new ones, or maybe caller-specified ones?) on exit from the graph; the
> debug info case would encode this "dumping" in symbolic form, not executable.
> 
> This strategy for never buffering will fail in some cases, notably for partially
> optimized code (first-tier, profiling code) which works on values of multiple
> types.  (This doesn't appear in the MVT time frame, but will show up I think
> with enhanced generics.)  I think we want to keep open the option that the
> IR can work with either buffered values (machine addresses to managed
> blocks) or bundles of registers containing plain values.  Do we ever need
> both?  Maybe not, but it's hard to say.
> 
> I think it is reasonable to assume that we need something like what we
> have with the processing of compressed oops:  There is IR for both encoded
> (compressed uint32) and decoded (uncompressed, natural intptr_t) oops.
> I think our IR tends to work with the decoded versions, but there are special
> post-passes which note when a decoded version is not needed, because
> it is simply the decoding of an encoded oop which is subsequently re-encoded.
> (I.e., graph which is  E(D(E(x)) is simplified to E(x), removing real instructions.)
> 
> Perhaps parallel logic can convert B(U(B(v)), where a value B(v) that is never
> unbuffered (U(B(v))) except to rebuffer it is never unbuffered in the first place.
> But all other values can be routinely unbuffered, at least as soon as we have
> a static type.  In the future we may see code which *mandates* a mix of buffered
> and unbuffered values:
> 
>    int f(Q-Object x) { if (x intanceof IntPair) { return ((Q-IntPair)x).fst; } else return 0; }
> 
> As soon as you know that x is an Q-IntPair you can (and should) treat it as a
> plain value.  But before that point it must stay buffered.
> 
> A final point:  I think it is reasonable to un-pin *all* buffer allocation sites.
> Object allocation sites, must execute in a semi-predictable order, to preserve
> identity semantics that depend partially on some aspects of ordering.
> (Specifically, if you allocate twice, you can't reorder that by merging the
> allocations even if they contain the same components.)  I think this is
> the main reason we have to pin object allocations.  In any case, the case
> of buffer allocations is completely different.  The only reason we have to
> pin a buffer allocation is that it might trigger a GC, which might in turn
> trigger a deoptimization of the blocked frame (stuff happens), which in
> turn requires a complete JVM state for that frame, which in turn requires
> control flow.  (C2 can't float allocations like Graal does.)  What a pain!
> 
> There is a way to ease the burden on buffers:  Define an IR node that
> creates a new buffer out of "thin air", without a control flow edge.
> Use this node at exit points and/or to aggressively break up phis
> of buffered values.  There is a cost to this:  The node is not allowed
> to block, and certainly not to GC.  This is implementable I think, by
> using thread-local storage (on the C heap or stack) and maybe also
> by stealing from the GC's TLAB (when it works).  Since we have
> thread-local heaps for values already, we can just allocate these
> JIT-created buffers on the thread-local heap.  As a corner case,
> some pathological value types are marked as being "only in the heap".
> We would have to add yet another corner case, of allowing the JIT
> to allocate these guys on the thread local heap, just for JIT-created
> buffers.
> 
> So, in brief:  Let's look at giving the JIT a non-blocking hook for
> allocating a thread-local buffer.  And teach the JIT to move these
> buffering operations around to the exits only.  This will be a win
> if the JIT-created buffers are more flexible to use by the JIT than
> the buffers given by the interpreter, and than pure heap-based
> buffers.  The runtime needs to accept these JIT-allocated buffers
> as return values and state values after deoptimization.
> 
> Anyway, that's the way things look to me, from a certain distance.
> Comments?  Corrections?
> 
> — John
>