RFR 8156073 : 2-slot LiveStackFrame locals (long and double) are incorrect

Wed Aug 24 17:07:25 UTC 2016

On Aug 22, 2016, at 9:30 PM, Mandy Chung <mandy.chung at oracle.com> wrote:
> 
> We need to follow up this issue to understand what the interpreter and compiler do for this unused slot and whether it’s always zero out.

These slot pairs are a curse, in the same league as endian-ness.

Suppose a 64-bit long x lives in L[0] and L[1].   Now suppose
that the interpreter (as well it might) has adjacent 32-bit words
for those locals.  There are four reasonable conventions for
apportioning the bits of x into L[0:1].  Call HI(x) the arithmetically
high part of x, and LO(x) the other part.  Also, call FST(x) the
lower-addressed 32-bit component of x, when stored in memory,
and SND(x) the other part.  Depending on your machine's
endian-ness, HI=FST or HI=SND (little-endian, x86).
For portable code there are obviously four ways to pack L[0:1].
I've personally seen them all, sometimes as hard-to-find VM bugs.

We're just getting started, though.  Now let the interpreter generously
allocate 64 bits to each local.  The above four cases are still possible,
but now we have 4 32-bit storage units to play with.  That makes
(if you do the math) 4x3=12 more theoretically possible ways to
store the bits of x into the 128 bits of L[0:1].  I've not seen all 12,
but there are several variations that HotSpot has used over time.

Confused yet?  There's more:  All current HotSpot implementations
grow the stack downward, which means that the address of L[0]
is *higher* than L[1].  This means that the pair of storage units
for L[0:1] can be viewed as a memory buffer, but the bits of L[1]
come at a lower address.  (Once we had a tagged-stack interpreter
in which there were extra tag words between the words of L[0]
and L[1], for extra fun.  We got tired of that.)

There's one more annoyance:  The memory block located at L[0:1]
must be at least 64 bits wide, but it need not be 64-bit aligned,
if the size of a local slot is 32 bits.  So on machines that cannot
perform unaligned 64-bit access, the interpreter needs to load
and store 64-bit values as 32-bit halves.  But we can put that
aside for now; that's a separable cost borne by 32-bit RISCs.

How do we simplify this?  For one thing, view all reference
to HI and LO with extreme suspicion.  That goes for misleadingly
simple terms like "the low half of x".  On Intel everybody
knows that's also FST (the first memory word of x), and
nods in agreement, and then when you port to SPARC
(that was my job) the nods turn into glassy-eyed stares.

Next, don't trust L[0] and L[1] to work like array elements.
Although the bytecode interpreter refers directly to L[0]
and indirectly to L[1], when storing 'x', realize that you
don't know exactly how those guys are laid out in memory.
The interpreter will make some local decision to avoid
the obvious-in-retrospect bug of storing 64-bits to L[0]
on a 32-bit machine.  The decision might be to form the
address of L[1] and treat *that* as the base address of
a memory block.  The more subtle and principled thing
to do would be to form the address of the *end* of L[0]
and treat that as the *end* address of a memory block.
The two approaches are equivalent on 32-bit machine,
but on a 64-bit machine one puts the payload only
in L[1] and one only in L[0].

Meanwhile, the JIT, with its free-wheeling approach
to storage allocation, will probably try its best to ignore
and forget stupid L[1], allocating a right-sized register
or stack slot for L[0].

Thus interpreter and JIT can have independent internal
conventions for how they assign storage units to L[0:1] and
how they use those units to store a 64-bit value.  Those
independent schemes have to be reconciled along mode
change paths:  C2I and I2C adapters, deoptimization, and
on-stack replacement (= reoptimization).

The vframe_hp code does this.  A strong global convention
would be best, such as always using L[0] and always storing
all of x in L[0] if it fits, else SND(x) in L[0] and FST(x) in L[1].
I'm not sure (and I doubt) that we are actually that clean.

Any reasonable high-level API for dealing with this stuff
will do like the JIT does, and pretend that, whatever the
size of L[0] is physically, it contains the whole value assigned
to it, without any need to inspect L[1].  That's the best policy
for virtualizing stack frames, because it aligns with the
plain meaning of bytecodes like "lload0", which don't mention
L[1].  The role of L[1] is to provide "slop space" for internal
storage in a tiny interpreter; it has no external role.  The
convention used in HotSpot and the JVM verifier is to
assign a special type to L[1], "Top" which means "do not
look at me; I contain no bits".  A virtualized API which
produces a view on such an L[1] needs to return some
default value (if pressed), and to indicate that the slot
has no payload.

HTH

— John

P.S.  If all goes well with Valhalla, we will probably get
rid of slot pairs altogether in a future version of the JVM
bytecodes.  They spoil generics over longs and doubles.
The 32-bit implementations of JVM interpreters will have
to do extra work, such as have 64-bit slot sizes for methods
that work with longs or doubles, but it's worth it.