More detail than I had intended on Layout description language

Wed Dec 17 23:15:54 UTC 2014

Sorry to be slow replying to previous email, but I've gotten snowed at this
end.  However, I think there are some quick answers to some questions:

On 2014-12-17, at 10:26 AM, Tobi Ajila <atobia at ca.ibm.com> wrote:

> > 2. We will add
> >    Unsafe.{be,le,native},{Load,Store,Cas}{Short,Int,Long,Float,Double}
> >    to enable efficient implementation of 1.  These should compile to intrinsics.
> >    The reason to do it this way is to ensure consistent translation of little
> >    languages into bytecodes across platforms, whenever possible, and also to
> >    minimize the offense given to the keepers of the Java/JDK faith by confining
> >    the ugly to the official pile of ugly.  No need for endian variants of byte
> >    load/store.
> This seems reasonable and portable bytecodes is probably the right design goal.  For the new Unsafe methods, let's ensure that the behaviour is well specified.
> 
> Are the Float/Double versions necessary?  So far the LD has been about specifying bits, not the type information.  How likely is it that Java's float/double map correctly onto the native side's representation of float/double?  Wouldn't using "Float.floatToRawIntBits(float) / intBitsToFloat(int)" and the Double equivalents make more sense?

On float./double I am not sure -- on the one hand you are right that between
possibly baroque bitswaps on the incoming integers and longs and existing
floatTo and toFloat methods, we could do this, on the other hand there's going
to be some hope of making things very efficient and that might be added by
including a primitive (else we put ourselves into the pattern-matching intrinsic
substitution business).  Probably I need to attempt to concoct a use case, and
before I go very far into it, I can say that what I imagine involves shared memory
and a pair of processors sharing that memory, with mixed endianness between
them.  Do we care?

Do GPUs enter into the float issue at all, or do they use processor formats?

> > 4. I don't know the best way to express offsets.  Uniformity suggests that
> >    we express all offsets in terms of bits, and we would then do container-math
> >    to extract the byte offset of the container (8, 16, 32, or 64-bit) and the
> >    shift/mask of the field to extract.
> >    That is, an endianness, offset, field size, and alignment/container size
> >    (ALERT: what about platforms that allow unaligned containers?) are required
> >    to identify the bits for a field.  Endianness tells us how to load the
> >    container, alignment/container size tells us both how large a thing to
> >    load and how to convert the bit offset into a byte offset + shift, and the
> >    size tells us what mask to use.
> 
> Our first thought on containers was that it unnecessarily exposes implementation details. We now see it as attempt to externalize the underlying memory model.

I think you need the containers to allow you to make sense of endianness
on any platform of different endianness.  It's part of the "how to support single
descriptions of network protocols" goal.  There may be a better way -- this is
the one that worked for me (and I got to it by reverse-engineering the C layout
rules, thus it is no surprise it is a good match for C layouts).

Maybe this is not entirely necessary -- I'm trying to think about the problem of
specifying and decoding a big-endian network protocol on a little-endian box,
given that I am told that I was presented with a big-endian spec.  Anything 
that lands in a memory byte is fine -- there's an offset, a load, and a shift+mask
to apply.  Suppose larger -- suppose the field is 8 bits at offset 13.  That puts
it in big-endian-bytes 1 and 2, meaning that it lands within a 32-bit quantity.
(included bits are 13-20, endpoints differ when divided by 8,16, but not 32
Therefore I can convert this to an appropriate offset within a 32-bit word.

So I think a container-free model is also possible, if it is expressed using offset
and size, and if endianness is made explicit (so we can do network protocols
across platforms with a single definition) -- however also note that we probably
need to know the expected alignment of the structure containing the fields.
There are surely optimizations that could take advantage of that.

An alternate approach is John Rose's proposal to do it with concatenated octet
bitfields -- this is nicely unambiguous too, but has the difficulty that it allows the
expression of many things that we'll need to implement (that we could get wrong)
yet there is no compelling use case for.  

> > If you look at C bitfields, there is very
> > much a model of “load a box this big, then shift the data towards the low order
> > bits some distance, then mask/sign-smear the part we don’t care about”.
> 
> The container model increases the amount of memory reads when accessing bit-fields.

Not quite -- you are taking the container model too literally (see above for its actual purpose).
(My earlier email did describe it literally -- part of this requires you to temporarily dial your
mental model of a compiler back to an age when register allocation and similar things
were done so poorly that register windows made sense).

I can tell you that C compilers will cheerfully substitute the smallest load that will grab the
referenced field, and they've done this for decades.

> Reading a single bit field has the side effect of reading all other fields in the container. This would affect the behaviour of systems with memory mapped I/O that take certain actions when an address is read. e.g. incrementing a counter every time an address is read.

It's my understanding (I'd need to check the C spec, I am working from memory)
that there are no guarantees made in C about how large a read is done to access
the bitfields of a structure, and certainly none smaller than the size specifier on the
field (i.e., uint32_t x:1 might be accessed with a load using "1", 8, 16, or 32 bits,
and perhaps even 64 on some processors).  That said, C programmers can be
surprisingly good at internalizing (worshipping?) the quirks of the compiler on their
own particular platform, and this will tend to be a problem.  It's not exactly a good
spec for us to say "mimic the most popular C compiler on the particular platform"
so I don't much like this (that said, if I had to bet my own money, I'd place a decent-sized
bet on the C compiler always picking the smallest load size that gets all the bits).

> We should also consider the performance impact of masking + shifting every time we interact with a bit field.
> In the example below, do we need to mask + shift + OR + CaS every time we write to a field in that structure?
> 
> struct A {
>     int16_t val1 : 8;
>     int16_t val2 : 8;
> }
> ">,16",
> "val1, 0, 16, 8",
> "val2, 8, 16, 8"

No, because (I am pretty sure that) C compilers do not do that either if they can get away
with byte-at-a-time loads and stores.  Just checked, clang really goes to town on the optimized
stores.  *but I have not checked what happens if the bitfields are volatile*  -- and I just did,
and it changes the behavior, how about that?

Well this is annoying.  How far do we want to go down the rathole of mimicking C behavior
*and* performance?

I'll try to put together some carefully written tests that will allows us to know more about the
behavior of C compilers, assuming we care about "volatile" fields.

> Will we have restrictions on container sizes?
Yes.

> If someone specifies a 128 bit container in the LDL, how will it be dealt with on platforms that don't have those container sizes?
If we ignore memory model issues, we have no problem with that --
it merely tells us how to deal with the address arithmetic
(which is all that I was initially thinking about here) and we can
make it work.  There is the minor issue of specifying containers
larger than actual containers, in that we can no longer do the shortcut
of "load container, shift, mask, done" -- we'll need to load, load,
shift, shift, mask, mask, or.  But if it is just a way to express
address ranges, only a SMOP.

> Do we want to go so far as to specify the number, ordering, and atomicity of memory operations used to read a layout field?
> 
> >   Perhaps, per field:
> >      offset (in bits)
> >      container size (in bits)
> >      [field size (in bits) = container size]
> >      [container alignment (in bits) = container size]
> >      [endianness = structure default]
> >
> >    Structure information would look like
> >      endianness (<,> -- see Python link for other options)
> >      container size (in bits)
> >      [container alignment (in bits) = container size]
> 
> Then there are some ambiguous cases. e.g.
> ">, 16",
> "x, 0, 8, 8",
> "y, 4, 8, 8", //where does this one start? it has offset of 4 but alignment of 8
> "z, 8, 8, 8"
> 
> What is the expected behaviour of this? or is it even legal? To be more general, how do we handle fields that don't fit into their containers?

If we are mimicking the behavior of C compilers,
there will be no fields that do not fit in their containers
(to the best of my knowledge).  We could, however,
elect to support it, but asking for volatile behavior on
this fields has to be an error because in the face of
native code activity we cannot possibly make that guarantee.

In the case of offset of 4, alignment of 8, that's just rejected.
However, John has suggested that we might want to support
something that was offset 4, container 8 (?), size 8, alignment 4, e.g.

"y, 4, 8, 8, 4", //where does this one start? it has offset of 4 but alignment of 8

I agree that this is screwy.  Under these rules, what the heck does the container mean?

Do we revisit John's method instead, perhaps with the restriction that all the parts
of a field must be contiguous under the specified endianness, plus an optional
"volatile" indicator?

One potential gotcha is that the volatile specification might interact with the container size
in C code -- I do not know this yet, but obviously it needs to be checked.

But again -- if the containers are a problem, I think a model of 

explicit endianness;

endian-dependent bitoffset + size spec for fields;

potentially explicit alignment for structures (note that there is an implicit desired
alignment obtained from the needs of the field loads, but there might be cases
where we wish to override this -- it would disable the larger-than-a-byte load 
optimizations for fields that straddle byte boundaries);

ability to tag fields as "volatile" or "atomic" (Doug, if you have some ideas
how this could be better or worse....)

and various errors signaled for impossible-to-implement, like volatile fields that
straddle a load-atomic boundary.

Sorry not to have a reply to the sub/sep email yet -- that's harder to answer I think,
plus I have been busy.

David

> To summarize, we think:
> 1) The LD implementation should be able to derive container sizes from field sizes and offsets
> 2) Container attributes create more opportunity for inconsistent layout descriptors.
> 
> If we need to specify a memory model, then we propose the following strawman rules:
> 1) Where a field is completely contained inside a container, and where the container size is no larger than a platform-dependent limit, reads and writes of the field will be atomic.
> 2) Reads and writes of a field may cause reads and writes of adjacent fields in the same container.
> 3) Modifying the value of a field preserves the value of adjacent fields in the same container. This rule does not apply to overlapping fields.
> 
> 
> Doug Lea's input may be valuable for refining this.
> 
> Is this level of specification needed?
>