yet more notes on varargs and calling sequences

Sat May 5 05:10:46 UTC 2018

The previous note didn't go into specifics about any particular ABI,
and (more subtly) it didn't make clear a very basic fact about varargs:
It is really just the normal argument passing conventions, under a light
coat of paint.  Contrast that to Java, where varargs uses a completely
different set of types (arrays instead of their elements).  This note
will relate varargs more closely to normal C calling conventions,
in the context of a specific example, the System V ABI for x86_64.

The day C grew function prototypes (a good day!) it gained two things.

First, a prototype lets a caller and callee make a checkable agreement
about static types.  Previously, under K&R declaration rules, the
agreement was just as important, but it was not so checkable.

Under K&R rules, whatever use-site types are present at a call determine
the calling sequence.  If they don't exactly match the def-site types, there
could be trouble.  For example, if I pass a short and the callee wants an
int, I need somebody to ensure that the short is sign-extended to an int
before it gets to the callee.  There are so many integral types that, if
each type had a different argument format, errors would be frequent.
This is why K&R C has the famous default (or "usual") conversion rules,
which normalize a large set of types to a much smaller set, so that
a small variation in the use-site type does not destroy the function
linkage.

What counts as a small variation?  Well, anything under an int
extending up to an int.  On a 64-bit machine, passing a long instead
of an int, or an int instead of a long, can cause unexpected truncation
or extension, but (with most ABIs) doesn't wreck other arguments.
Passing a pointer is just the same as passing a native machine word,
so all pointers have a common format.  This means that you can
neglect small differences, such as char* vs. void*, when linking
under K&R rules.  Pointers also have the same format as intptr_t.
(If we had machines that distinguished address from integer registers, the
intptr_t and pointer might go different ways, but they don't today.)
Finally, I can pass a float value and it will get widened to double,
so float vs. double counts as a small variation.

When we got prototypes, they included varargs as a bonus.  It turns
out that the simplest thing for an ABI to do with varargs is identify
them with K&R arguments, and then treat all arguments as K&R
with few extra fringe benefits.

This leads to the second new thing that prototypes do.  Sometimes
they let the compiler make a more efficient calling sequence.
Famously, floats don't get "usually converted" any more to doubles,
but can be passed unconverted.  In principle other conversions
could be avoided too, but it turns out that most of the other usual
conversions are of zero or negligible cost, given the basic paradigm
of passing a bunch of machine words in sequence from caller to
callee.

As a rule of thumb, if something already fits in a general register,
a prototype won't help improve its transmission.  Structs are already
passed by value under K&R rules, so prototypes are only likely
to help if they give crucial new information about the struct, such
as the fact that it looks like a complex number.  But even in those
cases, K&R rules may already have been adapted to favor complex
numbers by placing them (say) in pairs of floating point registers,
and the expression 'va_arg(vp, my_complex_struct)' may already
be doing something clever about pulling from registers.

There are types which are much newer than K&R rules and structs,
notably vectors.  For those, ABI designers have to make a serious
choice:  Do we treat them as structs, so they interoperate with varargs,
or do we treat them as special values packed into special registers?
The latter choice is often taken.  This means that prototypes do
allow, for some special types (and float) ways to pass data that
K&R functions, and varargs, can't touch.

But apart from those special types, prototypes provide just static
checking, followed by good old K&R calling conventions.  And
those conventions are almost always defined as a sequence
of machine words (of size intptr_t).  The types that are relevant
to these calling conventions are surprisingly few:  intptr_t (the
basic machine word), void* (the same word in pointer guise),
double, and structs.  Structs, in turn, are often broken down
into machine words which are bitwise copied from the first
memory unit of the struct forward, as contiguous intptr_t values.
The same goes for doubles (one or two intptr_t values).
The ABI may or may not demand respect for alignment of
doubles or structs beyond the native intptr_t; if extra alignment
is required, then an extra intptr_t of padding may show up.

But everything boils down to a series of intptr_t values, mixed
with special types like float, double, and vector, which may
pack into intptr_t values or may stay whole in special registers.
It is often the case that there is a 1-1 correspondence between
arguments and intptr_t values, especially on LP64 machines
where double and intptr_t look the same as memory words.

At this point ABIs get clever and hoist a set of lucky arguments
into whatever registers are assigned for the purpose.  The
System V ABI assigns 6 general purpose registers and 8 SSE
registers for arguments.  Well formed struct arguments *might*
get broken up into sub-arguments, each of which might be assigned
to a SSE or 64-bit integer register.  Vector types (which mostly
require prototypes) can be packed into single SSE registers.
Other ABIs do similar things; Solaris/SPARC/V9 uses the first
8 double registers for floating-point arguments and all 6 "out"
registers for other stuff, plus memory.

One might ask if ABIs get clever by packing (say) two shorts into
a register that holds an int, or four doubles into a vector register.
The answer is, generally, no, unless there is evidence that the
multiple components are logically a single argument, like a struct
or a vector.  Packing two unrelated values into a single intptr_t
is asking for trouble, since it just requires the callee to unpack
them again.  Saving a word at the top of the stack doesn't pay
us back for the extra machine cycles to pack and unpack, unless
there are special instructions for accessing sub-registers.

(There are half-register floats on SPARC, and I think that is regarded
as a mistake.  So, no packing arguments into registers, usually.)

With or without prototypes, with or without varargs, the calling
sequence for most C function calls is going to have a similar
character across many ABIs.  It is built on a series of machine words,
plus a few other values that are possible candidates for special
register treatment.

This leads to an interesting observation: We don't need all the
information in a C function prototype in order to accurately describe
the calling sequence of a function.  On the other hand, for some
arguments (K&R functions, varargs) there is no prototype.  There seems
to be a missing kind of information here, which is necessary to drive
function calls correctly.  This information can be derived from
function prototypes, or from the actual arguments to a K&R function
(after default promotions!).  What is this information and how do we
describe it?  Let's say that every C function call can be fully
represented (after static type checking) by decorating each of its
arguments with a token which denotes its ABI-specific argument
information.  (To complete the call, a return type is also needed, as
well as function entry point.  We'll ignore those.)  Each ABI has a
different account of how this information is deduced for arguments,
and how it combined into a full set of stack and register assignments.

Let's start with System V, which is here:
  http://software.intel.com/sites/default/files/article/402129/mpx-linux64-abi.pdf

Some references (clang) place it at this site, but it seems unavailable there:
  http://x86-64.org/documentation/abi.pdf 

Sections relevant to this discussion are 3.2 (function calls) and 3.5.7 (va_list).

Now, suppose we need to make a correctly formed call under the SysV
ABI, and we have already verified static types relative to any
available function prototypes.  Now, if we throw away the prototype
information (if any), we will need to restore enough information to
place each argument in the correct registers or memory locations.

Let's express this information as decorations on each argument.  For
this ABI, such decorations must encode information equivalent to the
following:

  INTEGER : anything that converts to intptr_t or void*
  FLOAT/S : any 32-bit floating type (argument prototype must be present)
  FLOAT : any 64-bit floating type, float-only structs <= 64 bits
  A/N:VECTOR/N : any vector type (N = 2, 4, 8), aligned
  BUFFER : object passed by invisible void* indirection (C++ only)
  2*INTEGER : structs that fit in 2 words with no floats or vectors
  2*FLOAT : structs that fit in 2 words with only floats (<80 bits)
  2*I_F, 2*F_I : certain 2-word structs with a clean mix of non-float and float
  A/2:2*INTEGER : special type _int128 with alignment requirement
  N*STACK :  everything else; N >= 1 counts intptr_t words
  A/N1:N*STACK :  memory with an alignment requirement (N1)

The special token FLOAT/S refers to register usage which is four bytes.
All other register uses are multiples of 8 bytes.  80-bit floats go to memory.
Memory is used as a single aligned sequence of 8 byte units.

To give a hint of how this generalizes to other ABIs, I've renamed
some of the SysV ABI argument class names: POINTER is merged into
INTEGER, MEMORY is renamed as STACK, and SSE is split into FLOAT and
VECTOR.  Apparently, X87 arguments are always dumped to stack, so that
doesn't need a decoration here.

Anything marked INTEGER will go into a general register, if available,
else it's the same as STACK.  Likewise with FLOAT and VECTOR (which
are really the same register class in this case).  A BUFFER argument
is not just a pointer, but rather an invisible pointer to a separately
buffered value which will be readable and writable by the callee.  A
prefix like N* indicates that there are several machine words to
handle at once.  A prefix like A/N: indicates that, if the item is
stored in memory, it must be aligned to N*8 bytes.  The special I_F
and F_I types denote two-word mixes with INTEGER and FLOAT
decorations.  There are no mixes of more than two words in this ABI.
ABIs which allow more complex mixes may require more combinations of
decorations.  It seems likely that, for any ABI, decorations like the
above will express the necessary degrees of freedom, and they can be
comfortably encoded in a single Java long.

The details of assigning decorations to arguments are driven by
prototypes or (lacking prototypes) a "K&R" type derived from the
actual argument type.  That will be one of void*, intptr_t, double, or
a struct type.  The ABI specifies details for assigning decorations;
let's just say that the output is one of the above decorations for
each argument.

So, suppose we have the arguments and their decorations.  We need to
pick places to store each distinct argument chunk (of size 1-8 bytes)
into into an 8-byte machine word or a register.  Note that, in a
setting like the JVM binder, it could be that the arguments have
already been buffered in something like a long[] array, waiting to be
written to stack and registers.  In some cases, the arguments will be
long values spread out on the Java stack, passed to an intrinsic like
linkToNative.  Independently, as an engineering decision.

The algorithm for binding the decorated arguments is as follows:

  1a. initialize the set of available I-registers to rdi, rsi, rdx, rcx, r8, r9
  1b. initialize the set of available V-registers to xmm0 through xmm7
  2a. map all occurrences of BUFFER to INTEGER, replacing the object with its pointer
  2c. map all occurrences of FLOAT/S and FLOAT to VECTOR/S and VECTOR/1
  3a. split 2*I_F (resp. 2*F_I) arguments to INTEGER, VECTOR/1 sub-arguments (resp. or VECTOR/1, INTEGER)
  3b. split all remaining arguments with a multiplier prefix into sub-arguments with the same decoration
  3c. in all of the above splits, record the relation of splits to original arguments for step 5d
  3d. in all of the above splits, record any alignment prefix (A/N:) on the first sub-argument for step 5e
  4. for each sub-argument or argument (left to right) perform step 5
  5a. if the decoration is INTEGER, use the next I-register, if possible
  5b. otherwise, if the decoration is VECTOR/*, use the next V-register, if possible
  5c. otherwise, if the decoration is not STACK, change it to STACK
  5d. if the decoration is STACK, and the sub-argument is related to previous sub-arguments,
      revert all related non-STACK decorations to STACK, and return the registers to I- or V- sets
  5e. if the decoration is STACK and it has an alignment request, honor the request by adding empty 8-byte slots
  5f. if the decoration if STACK, assign an 8-byte slot of memory, incrementing the memory size
  6a. move SP downward as needed to allocate space for all STACK arguments
  6b. align SP downward as needed to honor the most demanding of alignment requests, if any
  7. perform all assignments to the assigned registers or memory words
  8. for K&R functions or if prototype has "…", assign the final number of VECTOR/N decorations to rax

This algorithm contains elements common to most ABIs, such as a greedy
choice from a register class, backed up by a dump to an array of
memory words.  It also has ABI-specific features, like step 8.  Step
2c collapses floats into vectors, which is specific to x86.  Step 3b
splits small structs into components of distinct kinds, a decision
specific to this ABI (which surely complicates the va_arg macro).

Steps 1-6 can, in many cases, be done offline, either in a static
compiler, a binder runtime, or a JIT, as long as the argument
decorations can be deduced or speculated offline.

Other ABIs would have simllar sets of decorations and register
selection rules.  Generically, there is usually one or more classes of
registers which are assigned to matching arguments greedily from left
to right.  Some structs may be passed by reference, others in memory,
and some special ones in a combination of places, modeled above by the
concept of "sub-arguments".  Sometimes there are ad hoc alignment
constraints that cause empty slots to be inserted into the memory
layout.

It is not an accident that ABIs work greedily from left to right.
Correct implementation of the "va_arg(vp,T)" macro requires that an
argument's location (under K&R rules, a large subset of full ABI
rules) must be unambiguously determined by the type of the T argument,
the types of any previous arguments (but not following ones), and any
hidden metadata passed by the ABI (such as in step 8 above).  This
prevents ABIs from getting "too clever" by bin-packing the whole set
of arguments tightly into the available register resources.

How does this relate to Panama?  Well, not surprisingly, our native
call engine uses similar concepts as those above.  The current code in
nativeInvoker_x86.cpp is our prototype.  From Java it receives an
array of longs, encoding all the intptr_t arguments and sub-arguments
mentioned above.  It also receives a "shuffle recipe" that encodes the
decorations mentioned above.  The recipe unambiguously assigns to each
intptr_t value a register (or part of a vector register) or a stack
location.

(The creation of shuffle recipes is handled by the class ShuffleRecipe
and related classes in jdk.internal.nicl.abi.)

The shuffle recipe is at the heart of the logic for storing both
arguments and return values.  The recipe mentions the following
classes of storage locations, each of which is able to consumes a
stream of intptr_t argument values x from the Java array of longs:

  BUFFER : x assigned to the next buffer location (deep in stack)
  STACK : x assigned to the next stack argument location (from SP on)
  VECTOR : x assigned to the next 8-byte lane of current V-register (xmm0 through xmm7)
  //FLOAT : x assigned to the next F-register (if we had them, but we don't yet)
  INTEGER : x assigned to the next R-register (rdi, rsi, rdx, rcx, r8, r9)

The recipe is tensely bit-encoded in one or more longs.  Each three
bit field encodes one of these operations:

  PULL : take the next x and place it into the next location
  SKIP : skip the current partial register, or else skip a word (for alignment)
  STOP : switch to the next class of locations (BUFFER, STACK, VECTOR, INTEGER)
  CREATE_BUFFER : (BUFFER class only) label the current buffer location
  PULL_LABEL : (non-BUFFER categories only) place the next buffer label into the next location

(There are also encodings to locate the end of the recipe, and to
transfer from one long to the next long in an array.  Those are not
very interesting.)

Each class of argument is processed all at once, so arguments which
are mixed (2*F_I above) are not directly processed by this engine.
Instead, their intptr_t parts must already have been split apart and
separately marshaled by Java in the long array.  More importantly, all
separately buffered arguments (such as structs) are processed before
any of the other arguments, so their buffer pointers can be determined
on the fly.  Then the arguments destined for the stack are stored, and
finally the registers are packed.  A pre-pass over the recipe can
compute total sizes and register counts as needed.

So the long array is not ordered in argument order, but rather in
class order.  This is a significant departure from the usual
"argument-major" ordering that ABIs use to document argument storage.
(Recall this is partly driven by the requirements of va_arg .)
Instead, the ordering is "location-major", where the argument storage
locations are processed in a canonical order.  The "STOP" and "SKIP"
operations skip unused gaps in the canonical order without assigning
argument payloads to them.

The recipe method also works well for all kinds of return values, if
all data flows are reversed.  Finally, by re-reversing both data
flows, it can also assist in correctly forming callbacks from native
code.

(The alert reader will note that the recipe language didn't mention
actual register names.  It doesn't have to, because it is always
interpreted relative to a fixed set of ABI-defined registers and stack
locations.  That applies to both arguments and return values.)

The recipe is simple to traverse, both at compile time and run-time.
Currently we use interpretive techniques to shuffle between Java long
arrays and register and stack dump areas, according to a given recipe.
But it is also possible to open-code the execution of any given recipe
and generate normal JIT IR to perform the movements.  At that point,
the long array can also be exploded into separate arguments, and we
will get code that is approximately equivalent to a call site produced
by a native compiler.

(The alter reader has also noticed that the recipe treatment is a
little too clean to implement the SysV ABI fully.  What happened
to the assignment to rax for varargs?  How is alignment handled
for the memory storage, for real?  These details are left to the
alert reader for further enjoyment.)

You can see why I hope that we have a framework that can flexibly
express calls on future ABIs as they become available.  Perhaps
future recipes need a few more classes (like FLOAT) and maybe
another operator, but they are pretty expressive even now.

Well, that's enough for now.  Who knew varargs were so involved?

— John

P.S. Hero-kudos to Mikael Vidstedt for developing the recipe code.