decoupling interpreter buffers from the JIT's value processing

Wed Jul 5 19:31:00 UTC 2017

TL;DR I think we need non-blocking value buffers in the JIT.

This morning we discussed some details of removing buffers from optimized code.
The problem is that the interpreter generates buffers for values (so it can treat them
uniformly) but the JIT wants to disregard them in order to optimize fully.  To do this,
the JIT has to (more or less routinely) unpack the actual values from the interpreter
buffers.  It's important to do this in order to do value numbering optimizations on
the values themselves, rather than on the addresses of the buffers that the interpreter
assigns to carry the values.  (One value can be carried by several buffers.)

This manifests in the IR as parallel sets of IR nodes.  Some are buffer addresses,
artifacts of the interpreter's decisions about buffering, and some nodes represent
unbuffered values (plain values, values per se), which can be register allocated,
spilled to the stack, passed as spread arguments, etc.  Let's call these two sets
"buffered values" and "plain values".  They have different types in the IR, and
allocate to different sorts of registers.  (A plain value allocates in general to a
tuple of registers.)

If there is control flow in the IR, then the phi nodes which carry control flow
dependencies of value types are cloned into separate phis, some for buffered
values, and some for plain values.   The basic requirement for the the optimizer
is that the buffered value phis never get in the way of optimization.  Specifically,
even if the interpreter constructs a new buffer each time through a hot loop,
the JIT must be free to discard the phis which represent that buffer.

There are edge cases where buffers are still needed.  AFAIK, the only important
edge cases are "exits" from the optimized code, either returns or deoptimizations.
In those cases, an outgoing value will need to be put into a buffer so that the
interpreter can receive it, either as a return value or as part of a JVM state
loaded into an interpreter frame that is the product of a deoptimization event.

All this is almost exactly the same as the techniques we already use to
scalarize non-escaping POJOs.  The POJOs are reconstructed as true
heap objects in the case of (some) deoptimizations; this is implemented
by a special kind of value specifier which guides the deopt. transition code
to create the POJOs from scattered components, as JVM interpreter frames
are populated.

A key difference between the existing EA framework and the new mechanisms
is that our existing infrastructure is very conservative about tracking the identity
of references.  This makes it hard to just throw away the phis that carry the
object references, even if we have a full model of the object components in
a parallel set of phis.

I think a good first move for adapting our existing IR framework to the task of
buffer elimination is to aggressively "split" buffers (just as we currently "split"
live ranges) at all points where it makes it easier for us to cut away IR nodes
that only do buffer tracking.  This starts with exit points:  Value returns, and
debug info about JVM states containing values.  But perhaps every phi that
carries a buffer should also be split, carrying a virtual buffer-copying operation.
This latter move would make buffer-carrying phis to immediately go dead,
since each (virtual) buffer-copying operation would just pick up the value
components from the corresponding value-tracking phis.

It might seem that a better long-term move for adapting our IR framework
would be to refuse to translate interpreter buffering operations into JIT IR,
and aggressively work *only* with components.  The advantage would be
that buffering would appear in the IR only as an edge effect:  Components
are loaded from buffers on entry to the graph and are dumped into buffers
(new ones, or maybe caller-specified ones?) on exit from the graph; the
debug info case would encode this "dumping" in symbolic form, not executable.

This strategy for never buffering will fail in some cases, notably for partially
optimized code (first-tier, profiling code) which works on values of multiple
types.  (This doesn't appear in the MVT time frame, but will show up I think
with enhanced generics.)  I think we want to keep open the option that the
IR can work with either buffered values (machine addresses to managed
blocks) or bundles of registers containing plain values.  Do we ever need
both?  Maybe not, but it's hard to say.

I think it is reasonable to assume that we need something like what we
have with the processing of compressed oops:  There is IR for both encoded
(compressed uint32) and decoded (uncompressed, natural intptr_t) oops.
I think our IR tends to work with the decoded versions, but there are special
post-passes which note when a decoded version is not needed, because
it is simply the decoding of an encoded oop which is subsequently re-encoded.
(I.e., graph which is  E(D(E(x)) is simplified to E(x), removing real instructions.)

Perhaps parallel logic can convert B(U(B(v)), where a value B(v) that is never
unbuffered (U(B(v))) except to rebuffer it is never unbuffered in the first place.
But all other values can be routinely unbuffered, at least as soon as we have
a static type.  In the future we may see code which *mandates* a mix of buffered
and unbuffered values:

  int f(Q-Object x) { if (x intanceof IntPair) { return ((Q-IntPair)x).fst; } else return 0; }

As soon as you know that x is an Q-IntPair you can (and should) treat it as a
plain value.  But before that point it must stay buffered.

A final point:  I think it is reasonable to un-pin *all* buffer allocation sites.
Object allocation sites, must execute in a semi-predictable order, to preserve
identity semantics that depend partially on some aspects of ordering.
(Specifically, if you allocate twice, you can't reorder that by merging the
allocations even if they contain the same components.)  I think this is
the main reason we have to pin object allocations.  In any case, the case
of buffer allocations is completely different.  The only reason we have to
pin a buffer allocation is that it might trigger a GC, which might in turn
trigger a deoptimization of the blocked frame (stuff happens), which in
turn requires a complete JVM state for that frame, which in turn requires
control flow.  (C2 can't float allocations like Graal does.)  What a pain!

There is a way to ease the burden on buffers:  Define an IR node that
creates a new buffer out of "thin air", without a control flow edge.
Use this node at exit points and/or to aggressively break up phis
of buffered values.  There is a cost to this:  The node is not allowed
to block, and certainly not to GC.  This is implementable I think, by
using thread-local storage (on the C heap or stack) and maybe also
by stealing from the GC's TLAB (when it works).  Since we have
thread-local heaps for values already, we can just allocate these
JIT-created buffers on the thread-local heap.  As a corner case,
some pathological value types are marked as being "only in the heap".
We would have to add yet another corner case, of allowing the JIT
to allocate these guys on the thread local heap, just for JIT-created
buffers.

So, in brief:  Let's look at giving the JIT a non-blocking hook for
allocating a thread-local buffer.  And teach the JIT to move these
buffering operations around to the exits only.  This will be a win
if the JIT-created buffers are more flexible to use by the JIT than
the buffers given by the interpreter, and than pure heap-based
buffers.  The runtime needs to accept these JIT-allocated buffers
as return values and state values after deoptimization.

Anyway, that's the way things look to me, from a certain distance.
Comments?  Corrections?

— John