considering the LDL

Sat Dec 19 05:56:08 UTC 2015

On Dec 10, 2015, at 8:43 AM, Tobi Ajila <atobia at ca.ibm.com> wrote:
> 
> Link to docs: http://j9java.github.io/panama-docs <http://j9java.github.io/panama-docs>

My basic approach to LDL is to try to describe its design of
as simply and abstractly as possible, with a view towards
figuring out the minimum number of interlocking concepts.

One thing we can't do is try to emulate DWARF or
some other full-service language description scheme.
This is a weakness in the current jextract output,
where we currently output C types in ASCII form.

So, should language-specific names or types be in the
LDL?  Probably not, but (as you found) perhaps we need
some sort of naming if only to provide attachment points
to other metadata elsewhere.  What exactly are the names
in a LD?  Must they map to some source language, or
are they abstract link anchors to connect to other schemas?
(If they are just anchors, can we drop the names and
use just small integers?)

Do we need to allow expression of "complex number"
(I hope not!) or is it enough to allow a struct of two
scalars (possibly labeled)?  Do we even need the standard
distinction between signed, unsigned, and float,
and if so, why?  Should layout types be drawn from
C's primitives, or the set of operand types in some ISA,
or something else?

And what is the highest-leverage place to include
concepts like endian, volatile, atomic, immutable, alignment,
explicit offsets, bit-slices, and memory blocking factors?

Do we need separate concepts for on-heap and off-heap
or Java and native storage (I hope not)?

Can we design APIs for layouts that make a minimum
possible commitment to the design of the bookkeeping
that happens around memory accesses?  By that I mean

I like the clean minimal feel of what you've done, and
hope to (perhaps) shrink and sharpen it even more,
by evaluating the above concepts and attempting to
make it even more minimal.  Or if not, to clearly
identify the "extras" we want and be able to explain
exactly why.

For example, if a typical ISA makes an important distinction
between floating loads and integer loads, that might
be enough reason to make that distinction in layout types
also.  But the counter-argument is that the LDL could
ignore the distinction (working just with bit blocks)
and let the user decide what sorts of loads to issue,
based on richer type information outside the LDL.

A similar argument can be made even for endian-ness,
although that seems even more extreme than ignoring
float-ness.  In some sense, it's up to the end user to
decide which kind of endian load to issue, or, equivalently,
whether to use a native-endian load and issue a byte-swap
operation in the CPU.

Here's a stab at starting from next to nothing:
I think the absolute bare-metal minimum for a LDL
starts by postulating "memory", "registers", and
a "load/store unit" which moves bits between them.

The memory is an effectively endless sequence
of bits, grouped into bytes (octets), and (optionally)
into larger chunks, as power-of-two multiples
of octets with the usual alignment rules.

The registers are an effectively unlimited set
of fixed-sized variables, in a variety of sizes.
Sizes will often be power-of-two, but I think
we want virtual registers of any size, so we
can talk about block copying without explict
direct memory-to-memory operations.

(Though I'd rather not, perhaps registers also come
in classes, of integer, float, vector—maybe even pointer.
But arguably worrying about register classes is
a job for the user, and the user's register allocator.)

The LSU takes commands to move data between
the two places.  It is more complicated than it
looks, since loads and stores can differ according
to size, alignment, endian-ness, atomicity,
reorderability, and compatibility with register
sizes or classes.

(Though I'd rather not, perhaps we allow a register
to be incompletely filled by a load, or incompletely 
used by a store.  Then  there the question of
which register bits are ignored, or filled by zero
or extended sign bits.)

Every LSU command operates relative to a contiguous
range of memory bytes located by an "effective address".

All that seems both necessary and reasonably minimal
to me.  I am *not* trying to be hyper-general just for the
sake of it; I don't care today about memories with 9-bit or
1-bit bytes.  The exercise of generality is a way of simplifying
the design space, and factoring out separable concerns.

Adding in the LDL, we have a set of descriptions which
help us operate the Mem/Reg/LSU machine according to
meaningful "data access disciplines" and "data layouts"
(as your draft points out).

OK, so any particular formula of the LDL will describe a
"layout description".  I will call this a "Layout" for short.

(Your prototype uses the term "Layout" for the binding
of a layout description to a particular effective address.
I prefer to call that binding a "Reference" or "Location".
I don't have a better term.  But I'm pretty sure that "layout"
must be a term for something abstract that you work
with before you load or store your first byte.  My house
*has* a layout which it may share with other houses,
and which existed before the house was built;
my house is not itself a layout.)

What does a Layout do?  When combined with an
EA (effective address) it allows us to issue LSU
commands to move stored data around in a safe
and sane manner.  To do this, it must specify
one or more correspondences between bits in
memory and bits in registers (before either the
memory or register is chosen).

Since we assumed that LSU commands operate
on contiguous memory ranges, it follows that
a layout will determine one or more such ranges.
The embodiment of these ranges in a Layout are
called "containers" in your draft.  That's a fine term.
The point is that every container is separately
addressable via LSU commands.

Odd-sized containers can be loaded and stored into
odd-sized (virtual) registers, as if via block copy to
unaliased memory.  Some favored sizes and alignments
get better LSU functionality, such as atomicity.

(As a nice-to-have we could also consider breaking
down data at the bit-slice level.  Since this is not
a natural memory operation, it must be layered on
as an operation on registers.  Because some
native data formats are defined at the bit level,
perhaps we want to induce a numbering of bits
on Memory, and emulate a limited set of LSU
commands on those bit slices.  I'm skeptical
at present whether we need this.)

Given a Layout, a code generator could create
a full set of tiny subroutines (or IR snippets) which
execute the full repertoire of LSU commands
for each element of the layout.  The subroutines
would be named according to commands and
labels of elements within the Layout.

So, let's look at the concepts I mentioned
at the top of this message.

== endian-ness, aka byte order

I like the fact that your document makes byte
order explicit, rather than defer to platform byte
order.  The LDL should make that explicit,
just like it makes sizes explicit.  The LDL must
not express different layouts on machines with
different byte order or word sizes.  We want to
use the LDL for storage and transmission on
heterogeneous systems.  (Preaching to the
Java choir here.)

Let's assume that the LSU can load in either
byte order (scalars, I mean, not arbitrary blocks).
We have to assume this because otherwise
we can't have a single universal LSU model.
This means that each LSU command (load or
store) has a byte order attribute.

What would happen if the LDL ignores byte order?
Then the LDL cannot express a contract between
network endpoints about how to reassemble a
multi-byte scalar.  That seems bad.  So the LDL
needs this attribute (as your draft points out).

Typical designs don't mix byte order, so it is
plausible to specify byte order only at the
beginning of an LDL formula, and then wherever
it changes (if it does).

What about bit order?  I hope we don't need to put
this in the LDL, but the above argument would seem
to apply to bits also.  Bit-streaming formats tend to
make arbitrary choices about bit order just like CPUs
are arbitrary about byte order.  If we add the
"nice-to-have" feature of bit slices, I think we will
have to add bit-order notation also.

== size

A layout has a size in bytes.  If it consists of a single
element (traditional scalar value or other block of bytes),
its size is the size of that element.  If it is composite,
the size is derived from the composite; see below.

== alignment

Many machines (but not x86) care about alignment,
when they load scalar types.  Typically, there are
separate CPU instructions for aligned and unaligned
access.  The LDL must give code generators enough
information to choose between efficiently and generality.

Thus, an LDL formula needs to express its expectations
about alignment.  (We assumed above that memory
does clump together above the byte level; this is where
we use the assumption.)  What's an expectation?  It
will be a power of two 2^G, of course.

For generality, the expectation could also include a
modular offset H in [0..2^G-1], for an exotic layout with
a small hanging prefix.

Given an alignment (2^G,0) or (2^G,H), a code generator
will be able to generate both forms of access, and be
able to test (if necessary) the alignment of proposed EA's.

Given several layouts and a composition rule (union,
struct, array, etc.) alignments can inform offset generation
and an overall alignment for a combined layout.
This is what C structs do.  That leads us to…

== composition

Layouts are really boring until you start combining them.
What is the simplest possible combination rule?
Concatenation, obviously, then replicated concatenation,
then overlaying at a common address.  Thus we get
struct, union, array.

A layout combination operator would take two layouts
and create a composite one, assigning new offsets
for each element in its new position in the composite.
When combined with alignment restrictions, we start
to see a demand for padding.  The rules get suspiciously
close to complex and arbitrary.

I'd like to propose a simpler, less structured composition
operator, a "nest" (taking a cue from "nested layout"
in your draft).  A "nest" is a sequence of one or more
layouts, each associated with a non-negative offset.

The offset of the first is zero.  The first layout is probably
not composite; it acts as the parent of the whole nest.
The other layouts can be called children or nestmates.
Crucially, the byte spans of all children must be wholly
contained in the span of the parent.

If all the offsets in a nest are zero, we have a union.
If all the child offsets are increasing from zero, and
the children are non-overlapping with no padding,
we have a packed struct.

The nest becomes more interesting if you add more
constraints on formation.  For example, the alignment
of the parent must satisfy the alignment of all the children,
in the obvious way.  Imposing this constraint will force
C-like structs to "grow" padding, if their children have alignment
constraints.  Or it will force the children to lose their constraints.
There's no one right answer, so it has to be an option.

I like the fact that there are no auto-padding rules in your
draft.  But in order to allow code generators to trust alignments,
I think we need rules for composition that preserve alignment
information in the way sketched above.

Basically, if you are putting together child layouts that have
alignment, you need to get the offsets right.  If you make
a mistake, the LDL constraints will tell you it is wrong, but
they won't fix it up for you behind your back.

Besides alignment checks, it is important that no child
include bytes outside the span of the parent.   Beyond
that, there's no logical reason to prevent children from
overlapping, or parents from containing unused padding.

== names

What gets named in an LDL formula?  If the formula is
simple (not composite) there's no need for a name.
In the case of a nest, a code generator must be able
to respond to requests to access the whole nest (the
parent), or any of the children.  That's where names
come in.

But small integers are just as good, if you don't mind
keeping count.  And machines don't mind.  What you
come to are little path expressions like (1,2,1,1).

But I do think the LDL needs a pretty-printed version,
to be used for documentation or LDL processing tools,
in which names can be inserted (as if they were
annotations) and later on translated to path expressions.

What I'm suggesting is that when jextract emits LDL
formulas into struct APIs, there is no need for it to
mention names, as long as there is as way to derive
the path expression for each API point.  A tool which
prints out a human-friendly version of jextract output
could re-insert the names, by drawing them from the
other classfile metadata.

(I'm always suspicious of designs which require the
repetition of information already derivable from some
other source.  What goes wrong, eventually, is the
repeated bits go out of sync, and then we get surprises.)

== types

Are registers just bundles of bits and bytes, or do they have kinds?
If they don't have kinds, then the LDL can remain silent about
kinds also, since memory blocks don't have kinds either.
I'd prefer it this way; leave it to the register allocator and
code generator to pick register kinds.

As noted before, we don't want to sign up to create a
LTS (little type system) in the LDL.  That's a job for
other notations like DWARF or C itself.

== atomicity

Crucially, every LSU supports atomic reads and writes for
some sizes, usually 1, 2, 4, and 8 bytes.  A code generator
will sometimes need to issue an atomic load or store (for
a permitted size).  Just as with alignment, a code generator
has to be able to prove that the size and EA are compatible
with atomicity.

I think compatibility always amounts to a restriction on size
and alignment, so the above LDL attributes are enough to
ensure correct atomic commands.

== immutability

You might wonder if some sort of "const" attribute would
make sense in an LDL.  I think not, for similar reasons
as atomicity:  It's a decision in the user's hands which
LSU commands to issue.

== arrays, repetition

An array is almost a struct, except you don't always know how
many elements it has.  As soon as the number of elements is
known, you can reduce it to a nest of array elements.
(More crudely, you can always turn it into a block with only
a size and alignment.)  Is there need for any more mechanism?

I would like to think of the "array" operator in the LDL as a
sort of macro for a variable nest.  The hard part here is deciding
how to deal with late-bound parts of the layout.  So far, every
layout and nested layout can be statically bound to a size
and alignment, but arrays spoil that.

Perhaps the answer is to give the LDL a replication operator,
and allow an LDL formula to evaluate to a "variable-sized layout",
with one or more size parameters to be supplied later.
A code generator would just arrange to take the extra
runtime parameters.

Do we do the C-like thing and require a variable-length
child layout to always occur at the end?  Well, actually,
the "nest" abstraction doesn't care about this.  But it does
require that the parent be provably large enough to include
the children.

If we add arrays to nests and stir vigorously, we get formulas
like these from pseudo-C:

   struct TwoArrays { int n1, n2; double a1[n1]; byte a2[n2]; };

There are two problems here:  The offset of a2 is a complex
expression, and the size of the whole is another expression.

In a parameterized LDL formula, TwoArrays can be like this:

  template<int $n1, int $n2>
  struct TwoArrays { int n1, n2; double a1[$n1]; byte a2[$n2]; };

A code generator would need to be passed $n1 and $n2.
(Extra cleverness would allow it to fetch them from n1 and n2,
optionally.)

More generally, all the numeric parts of an LDL formula could
be either constants or variables (where the set of variables
is declared at the beginning of the formula, like a lambda
or a template).  This seems overly general; I'm not sure what
is the right reduced-strength design.

In the end, it's clear we need to support a modest amount of
runtime binding for offsets (of array elements), but we can't
sign up to prove properties like "non-overlapping", or "tightly
packed" or even "properly aligned", if a runtime-sized array
has gotten in somewhere.  Probably at some point you have
to say to the static code generator "trust me, you won't
run off the end", or maybe "here are some numbers you can
use to check my math".  Then efficient math-checking
becomes the JIT's problem; that's how Java does it.

== optional data

What about data that may or may not be present under the
control of a separate bitmask?  That is a common trick for
aggressive packing.  It spoils C-like structs but we need not
fear it.  It can be modeled as a nest of arrays like TwoArrays,
where each array is of runtime length 0 or 1.

== variable-length scalars

If we have arrays, can we get variable-length ints?
There the bits that control length are right there
in the adjacent byte(s), and in an arbitrary fashion.

It's not clear how to merge this into an LDL.

In fact, recall that the basic assumption above
is that memory is random-access.  All you need
to do is take a base address, add some offsets
(perhaps computed at runtime) and fetch your data.

Variable-sized or parsed values belong to a streaming
model of memory, where you read one thing first,
make a decision, and then read the next thing.

So, probably there is an LDL-like thing adapted to
streaming, but we don't need to design it immediately.

== non-contiguous data

Does it ever make sense to load a layout element from
separated bytes?  Probably not, although I can imagine
places where you'd want to express something like this.
You can simulate this under layouts by allowing memory
itself to be abstracted; the LSU would operate on a
non-standard memory that is composed of overlays,
in such a way that the non-contiguous bytes become
briefly contiguous, long enough for the LSU to load via
a layout.  I'd say this is not a feature.

== chunking, vectorization

It would be helpful, on today's machines, if a code generator
can create vectorized operations on memory data.  That is,
the LSU would have "vector mode" or "streaming" commands
that work on vectors or streams of Layouts.

Getting vectorized loops right basically requires testing size
and alignment at runtime, and picking the right loop version.
The LDL already supports size and alignment queries, so
perhaps nothing more is required.

One possibility:  The LDL might include hints to allow code
generators to speculate on particularly favorable sizes or
alignments.  So the "bigarray(N,L)" operator would be identical
to "array(N,L)" but with an extra hint that N is likely to be
large enough to vectorize profitably.  Yuck, but you get the
general idea.

Similar to vectorization is cache line sparing.  The creation
or access of in-memory data might sometimes want to attempt
to touch as few cache lines as possible.  Again, size and
alignment speculations (or preferences or hints) in the LDL
might assist code generators with this.

But in the end this feels like names and types:  Nice to have,
sometimes, but not strictly needed.  Maybe the correct requirement
to deduce here is that the LDL be "friendly" to occasional injections
of annotated data, whether names, types, or speculation advice.

Well, that was long.

Merry Christmas, Happy Holidays, enjoy your break

— John