methods with scalarized arguments

Fri May 18 20:42:22 UTC 2018

On May 18, 2018, at 8:07 AM, Roland Westrelin <rwestrel at redhat.com> wrote:
> 
> Hi John,
> 
>> Are you imagining a single nmethod with two entry points?  Currently,
>> nmethods *do* have two entry points for distinct calling sequences.
>> This might add two more:  <VEP, UEP> x <Buffered, Scalarized>.

Thanks for the details.  There are some more tricks we could play, maybe.
To be clear, I'm not proposing a solution, just throwing out ideas.

> The way the calling convention is implemented in MVT, scalarized
> arguments can be in registers or on stack. There are cases where the
> scalarized calling convention needs more stack spaces for arguments than
> the buffered calling convention. Something like:
> 
> m(v1, v2, v3, v4, v5)
> 
> a buffered call would have all 5 arguments in registers but a scalarized
> call could required stack space for some of the arguments (say if all 5
> values have 4 integer fields).

Suppose An are argument registers.  We can neglect FP and vector
regs for now.  For some n, An is in a special stack location, not really
a register, but that doesn't change the logic of what I'm talking about.
Then the buffered calling sequence would probably be:

m(A0=v1, A1=v2, A2=v3, A3=v4, A4=v5)

The scalarized calling sequence could be:

m(A0=v1.f1, A1=v1.f2, A2=v1.f3, A3=v1.f4, 
 A4=v2.f1, A5=v2.f2, A6=v2.f3, A7=v2.f4, 
 A8=v3.f1, A9=v3.f2, A10=v3.f3, A11=v3.f4, 
 A12=v4.f1, A13=v4.f2, A14=v4.f3, A15=v4.f4, 
 A16=v5.f1, A17=v5.f2, A18=v5.f3, A19=v5.f4)

Clearly many of those An will be in the stack.
Also, it is clear that there is a need for more stack
here than for the previous calling sequence.
Is this close to what you are describing?

> So if m() has 2 entry points, the
> buffered entry point will need to extend the area on the stack for
> arguments so it can put scalarized arguments on the stack before it
> jumps to the scalarized entry point. That stack space is in the caller
> so that wouldn't play well with stack walking.

Yes; callers allocate stack area for all arguments.  The callee is
allowed to use this area for any purpose whatever.  The callee
must not touch any other area of the caller's stack frame, nor
may it attempt to deallocate the callee-allocated argument area.
The caller will eventually deallocate this area.

Note that callers are free to shuffle data around inside the area
allocated by the callee.  This means that if the callee somehow
"knows" to allocate lots of space for the caller, the caller can use
it as scratch.  This is the trick that the SPARC ABI uses to support
varargs.  All callers allocate enough stack memory to store a
varargs dump area, even if nobody is using varargs.  The cost
is tiny:  You just have some extra stack area, contiguous with
any stacked arguments, to hold the arguments in registers.
How much extra area?  Six words, since SPARC has six
argument registers.  In this way, a varargs method can dump
all six arguments to their dump area, and from that point on
all arguments are in a linear array on stack (caller allocated
stack!).  It's as if the caller passed arguments scalarized in
registers, but the callee converted the call to buffered on
stack.  Not as complex as what we are dealing with with
value types, but some interesting parallels IMO.

I think it would be possible (not necessarily desirable—just
brainstorming here) for compiled-code callers which pass buffered
value types to *also* allocate enough outgoing argument space
their stack frame to allow the caller to de-buffer everything.
That would give us frameless adapters, wouldn't it?

There would have to be some bookkeeping to remember which
items are value types and which aren't, and calling sequences
couldn't be invalidated by suddenly loading new value types
that were (up until now) just unknown types.  But that's not
a practical problem in the JIT, I think.  Value types are loaded
and known, mostly, by the time the JIT sets up calls.  There are
corner cases where nothing is known; in those cases there
should be a slower handshake of some sort which prevents
reformatting of arguments.  Idea:  Just like the Linux ABI passes
a vector count in rax (low byte), we could contrive to pass an
indication of how prepared the caller is for the callee to unpack
the arguments.   We would only want to do that for calls which
are potentially problematic, not all calls, unless the indication
could be smuggled into the code stream of the caller.  (SPARC
V8 ABI does the code stream trick also, for struct returns, but
it's ugly.)

> Tobias suggested 2 entry points and one calls the other: the buffered
> entry point allocates stack space, shuffle arguments, pushes some on the
> stack and calls the scalarized entry point. That would solve the stack
> space problem but quite likely introduces other challenges (do we emit
> the call from one entry to the other at compile time or call into the
> runtime and resolve it?

If everything is in one nmethod, then there's no need for resolution.
A call (or jump, if frameless) would transfer from the argument shuffling
code to the real entry point.  Its offset would be location independent
and assigned by the branch resolver in output.cpp.

> Is the buffered entry point apparent in C2 IR or
> is it a custom generated blob of assembly?

IR, I suppose.  There's already C2 code for converting between buffered
and scalarized views of values.

> How does this affect stack
> walking?

If frameless, not very much.  If frame-ful, then there would be repetitions
of methods in the walk, unless they were suppressed.  To suppress them,
we'd want to mark the PC ranges of the argument shuffling code specially,
so the stack walker could see when an nmethod was in that state.  The
stack walker already make some small distinctions between states in a
code blob (frame pointer is not set up before a particular PC).  If the argument
shuffling code were put into a distinct nmethod section, distinct from the
main code section, then a simple range check could tell the stack walker
which state the frame was in.

> Do we want to filter one of the activation of method m() from
> the stack that are reported on exceptions etc.?)

Yes.

> Or we compile 2 separates methods which sound like a waste of
> resources, requires runtime code to keep track of 2 separate nmethods
> for 1 method, runtime logic for dispatching and compilation policy
> change to trigger compilation of either one of the nmethods.

That sounds less desirable.  Although we take this ugly path with
OSR method versions.

> If we go with the 2 entry point solution, then null values are never
> allowed in compiled code.

To me that is a feature not a bug!

I'm not closing the door to nullable value types, but I am saying that
we want each method to know clearly, from local information, which
value types are nullable, and to expect that this occurs (for now)
*only* in legacy code.  In new code, value types are *never* nullable
(in today's designs; the future can wait).  What I particularly want to
avoid is a thought process like "nullability isn't important, because
legacy code might throw us a null, and we might want to do something
with it, so all value types are mostly-not-null-but-maybe-sometimes".

Most VT methods will be new code, and null has no overlap with values
in such code.  If a legacy method passes a null to new code expecting
a value, there must be an NPE.  If we try to work around the null, instead
of throwing, we are hurting the optimization of 99.9% of all future value
type code, on the grounds that legacy code must be given permission
to infect arbitrary new code with null values.

So, let's stay with one view on nulls per method, not two.

> With the 2 nmethods solution, the buffered
> nmethod could support running with null values at full speed (i.e. null
> values would not have to be gated so they don't enter the method). And
> it doesn't matter if m() is legacy or not. If it's legacy and null are
> passed around then only the buffered nmethod would ever be compiled and
> executed. If it's legacy and null are not passed then only the
> scalarized nmethod would ever be compiled and executed.

That's a nice story, except for my objection above.  You sketch four
states:  (nullable VT, clean VT) x (legacy, new).  I want to reject
the (nullable VT, new) combination totally.  If in the future we do
value types, the fourth state might be useful, but in that case
the *user* is writing nulls on purpose (instead of legacy code
playing badly by accident), and it seems likely that other techniques
will apply in that world.

— John