decoupling interpreter buffers from the JIT's value processing

Wed Jul 5 20:43:52 UTC 2017

John,

Just some precisions about the thread-local buffer used by the interpreter:
  - allocations in the thread-local buffer are not guaranteed to succeed,
    and the fall-back is to allocate in the Java heap
  - currently, thread local memory recycling is performed on frame deactivation
    and sometimes on back branches (based on counters), but not on failed
    allocation (this is something I might have to change)
  - values handled by the interpreter are not always buffered, if a value
    exists as a standalone value in the Java heap (static field, non-flatten
    field or array), the interpreter uses it directly instead of creating a
    buffered copy.

This current behavior could cause issues with the usage you've described
below, for instance using the thread-local buffer to avoid Java heap allocations
(and by consequence GC safepoints).

Some modifications are possible to ease the JIT’s work, but some don’t look
feasible, like to guaranteed allocation. Having more input about JIT requirements
would be useful to tweak the thread-local buffering.

Regards,

Fred

> On Jul 5, 2017, at 15:31, John Rose <john.r.rose at oracle.com> wrote:
> 
> TL;DR I think we need non-blocking value buffers in the JIT.
> 
> This morning we discussed some details of removing buffers from optimized code.
> The problem is that the interpreter generates buffers for values (so it can treat them
> uniformly) but the JIT wants to disregard them in order to optimize fully.  To do this,
> the JIT has to (more or less routinely) unpack the actual values from the interpreter
> buffers.  It's important to do this in order to do value numbering optimizations on
> the values themselves, rather than on the addresses of the buffers that the interpreter
> assigns to carry the values.  (One value can be carried by several buffers.)
> 
> This manifests in the IR as parallel sets of IR nodes.  Some are buffer addresses,
> artifacts of the interpreter's decisions about buffering, and some nodes represent
> unbuffered values (plain values, values per se), which can be register allocated,
> spilled to the stack, passed as spread arguments, etc.  Let's call these two sets
> "buffered values" and "plain values".  They have different types in the IR, and
> allocate to different sorts of registers.  (A plain value allocates in general to a
> tuple of registers.)
> 
> If there is control flow in the IR, then the phi nodes which carry control flow
> dependencies of value types are cloned into separate phis, some for buffered
> values, and some for plain values.   The basic requirement for the the optimizer
> is that the buffered value phis never get in the way of optimization.  Specifically,
> even if the interpreter constructs a new buffer each time through a hot loop,
> the JIT must be free to discard the phis which represent that buffer.
> 
> There are edge cases where buffers are still needed.  AFAIK, the only important
> edge cases are "exits" from the optimized code, either returns or deoptimizations.
> In those cases, an outgoing value will need to be put into a buffer so that the
> interpreter can receive it, either as a return value or as part of a JVM state
> loaded into an interpreter frame that is the product of a deoptimization event.
> 
> All this is almost exactly the same as the techniques we already use to
> scalarize non-escaping POJOs.  The POJOs are reconstructed as true
> heap objects in the case of (some) deoptimizations; this is implemented
> by a special kind of value specifier which guides the deopt. transition code
> to create the POJOs from scattered components, as JVM interpreter frames
> are populated.
> 
> A key difference between the existing EA framework and the new mechanisms
> is that our existing infrastructure is very conservative about tracking the identity
> of references.  This makes it hard to just throw away the phis that carry the
> object references, even if we have a full model of the object components in
> a parallel set of phis.
> 
> I think a good first move for adapting our existing IR framework to the task of
> buffer elimination is to aggressively "split" buffers (just as we currently "split"
> live ranges) at all points where it makes it easier for us to cut away IR nodes
> that only do buffer tracking.  This starts with exit points:  Value returns, and
> debug info about JVM states containing values.  But perhaps every phi that
> carries a buffer should also be split, carrying a virtual buffer-copying operation.
> This latter move would make buffer-carrying phis to immediately go dead,
> since each (virtual) buffer-copying operation would just pick up the value
> components from the corresponding value-tracking phis.
> 
> It might seem that a better long-term move for adapting our IR framework
> would be to refuse to translate interpreter buffering operations into JIT IR,
> and aggressively work *only* with components.  The advantage would be
> that buffering would appear in the IR only as an edge effect:  Components
> are loaded from buffers on entry to the graph and are dumped into buffers
> (new ones, or maybe caller-specified ones?) on exit from the graph; the
> debug info case would encode this "dumping" in symbolic form, not executable.
> 
> This strategy for never buffering will fail in some cases, notably for partially
> optimized code (first-tier, profiling code) which works on values of multiple
> types.  (This doesn't appear in the MVT time frame, but will show up I think
> with enhanced generics.)  I think we want to keep open the option that the
> IR can work with either buffered values (machine addresses to managed
> blocks) or bundles of registers containing plain values.  Do we ever need
> both?  Maybe not, but it's hard to say.
> 
> I think it is reasonable to assume that we need something like what we
> have with the processing of compressed oops:  There is IR for both encoded
> (compressed uint32) and decoded (uncompressed, natural intptr_t) oops.
> I think our IR tends to work with the decoded versions, but there are special
> post-passes which note when a decoded version is not needed, because
> it is simply the decoding of an encoded oop which is subsequently re-encoded.
> (I.e., graph which is  E(D(E(x)) is simplified to E(x), removing real instructions.)
> 
> Perhaps parallel logic can convert B(U(B(v)), where a value B(v) that is never
> unbuffered (U(B(v))) except to rebuffer it is never unbuffered in the first place.
> But all other values can be routinely unbuffered, at least as soon as we have
> a static type.  In the future we may see code which *mandates* a mix of buffered
> and unbuffered values:
> 
>  int f(Q-Object x) { if (x intanceof IntPair) { return ((Q-IntPair)x).fst; } else return 0; }
> 
> As soon as you know that x is an Q-IntPair you can (and should) treat it as a
> plain value.  But before that point it must stay buffered.
> 
> A final point:  I think it is reasonable to un-pin *all* buffer allocation sites.
> Object allocation sites, must execute in a semi-predictable order, to preserve
> identity semantics that depend partially on some aspects of ordering.
> (Specifically, if you allocate twice, you can't reorder that by merging the
> allocations even if they contain the same components.)  I think this is
> the main reason we have to pin object allocations.  In any case, the case
> of buffer allocations is completely different.  The only reason we have to
> pin a buffer allocation is that it might trigger a GC, which might in turn
> trigger a deoptimization of the blocked frame (stuff happens), which in
> turn requires a complete JVM state for that frame, which in turn requires
> control flow.  (C2 can't float allocations like Graal does.)  What a pain!
> 
> There is a way to ease the burden on buffers:  Define an IR node that
> creates a new buffer out of "thin air", without a control flow edge.
> Use this node at exit points and/or to aggressively break up phis
> of buffered values.  There is a cost to this:  The node is not allowed
> to block, and certainly not to GC.  This is implementable I think, by
> using thread-local storage (on the C heap or stack) and maybe also
> by stealing from the GC's TLAB (when it works).  Since we have
> thread-local heaps for values already, we can just allocate these
> JIT-created buffers on the thread-local heap.  As a corner case,
> some pathological value types are marked as being "only in the heap".
> We would have to add yet another corner case, of allowing the JIT
> to allocate these guys on the thread local heap, just for JIT-created
> buffers.
> 
> So, in brief:  Let's look at giving the JIT a non-blocking hook for
> allocating a thread-local buffer.  And teach the JIT to move these
> buffering operations around to the exits only.  This will be a win
> if the JIT-created buffers are more flexible to use by the JIT than
> the buffers given by the interpreter, and than pure heap-based
> buffers.  The runtime needs to accept these JIT-allocated buffers
> as return values and state values after deoptimization.
> 
> Anyway, that's the way things look to me, from a certain distance.
> Comments?  Corrections?
> 
> — John
>