layout description - a proposal

Fri Mar 2 18:54:15 UTC 2018

Hi,
in the past few weeks I carried out some analysis to explore (i) what 
other FFI supports are doing in terms of layout description [3] and also 
(ii) what kind of layout descriptions are common in message protocols 
[4]. After having collected all this data, I now feel more confident in 
coming up with a proposal which should be simple, but still expressive 
enough to capture more exotic use cases (as can be found in message 
protocols).

Below, I'm going to describe the _requirements_ of the layout 
description that is put forward by this proposal. As such, I shall make 
no assumption on how such description might be surfaced in a potential 
language. After all, the goal of this exercise is to capture the 
semantics of a description - if, at some point we feel that such 
description should be reified into a layout language, we can do so 
accordingly, but I feel that should be the last step, not the first. 
That said I'd also like to thank all the folks from IBM who contributed 
to the current LDL effort (see [2]) - I think many of the conclusions 
reached in that document are still valid - with a few twists that I'm 
going to show below.

1) Scalars: we should only support three _kind_ of scalars: signed 
integrals, unsigned integrals and floating points. It is important for 
the description to distinguish between these three kinds, as a scalar 
kind affects how a scalar value is treated (e.g. which CPU register 
should be used for a load operation). Vector support should also be 
considered in the future - as another possible scalar kind. The size of 
a scalar should always be a multiple of 8, except for bitfields (see below).

2) Explicitness: the description should be as _explicit_ as possible. 
That is, details such as (i) endianness, (ii) size of a scalar should 
always be reified in the language description and not be subject to 
platform-dependent considerations. (We will see later how 
platform-dependent types such as C's 'int' could be implemented).

3) Addresses: in addition to scalars, we also need to have an explicit 
description for layouts that are meant to represent memory 
addresses/pointers. Such description could (optionally) reify info about 
the layout of the memory region pointed to by the address.

4) Layouts can be combined into groups; two groups are supported: 
product-like groups (aka structs) and sum-like groups (aka unions). A 
group is made up of several element layouts.

5) Layout can be repeated (e.g. _array_ layouts); some support should be 
provided for cases where the repetition count is not known statically

6) Layouts can be _annotated_ - that is, it should be possible to 
associate key=value annotations to any layout. Of these, a special role 
should be given to 'name' annotations, that could allow that layout to 
be referenced from other layouts (see below).

7) Named layouts can be _referenced_ from other layouts. This is a 
crucial property; once a layout has a name, another layout can refer to 
it by name via an  _unresolved layout_. An unresolved layout is simply 
an annotated hole - where the contents of the hole will be replaced 
dynamically with a suitable layout (which layout is replaced into the 
hole depends on the annotations available in the unresolved layout). 
This takes care of a bunch of use cases:

    (a) express dependencies between multiple layouts w/o the need of 
inlining one layout description into another (which could lead to cycles)
    (b) have a way to refer to the layout of a struct field; if a native 
struct whose name is S has a field layout whose name is f, the field 
layout could be referenced using an unresolved layout which is annotated 
with a layout expression like "S.f".
    (c) allow for macro-like behavior; for instance, platform dependent 
types such as 'int', 'long' could be modeled as references to an hidden 
layout which contains a bunch of platform-dependent sub-layout definitions.
    (d) could be used to represent intra message dependencies in message 
protocols

8) Integral scalars can be broken down to bit fields - that is a scalar 
can be associated with a _group overlay_, which define a substructure 
that is to be associated with the said scalar. Fields in the overlay can 
be named, and, as a result can be the target for replacements within 
other unresolved layouts. Within the substructure of the overlay group, 
scalar fields can have sub-byte sizes (the only place where this can 
happen).

I think this covers the basics; as you can see, this proposal is 
somewhere in between the Type Descriptor proposal [1] and the LDL 
proposal [2]; it gives up ability to denote language dependent types 
(such as int, and float), which is present in TD; at the same time, it 
commits to an 'always explicit' policy, which is not the default in TD. 
As such, it can be argued that a description in my proposed language is 
very precise and also machine-dependent (which is the same choice LDL 
does). But it also gives up some of the generality of the LDL proposal; 
namely, the ability to reason about non-byte-aligned layouts (e.g. in 
LDL you can say things such as 'b13', to denote a sequence of 13 bits); 
this feature would be rarely used in practice, and in my analysis of 
message protocols I did not find any need to model arbitrary bit layout 
- it is very typical for message protocols packets to be byte-aligned, 
to minimize encoding/decoding effort. Also, endianness is dealt with in 
a much simpler way - that is, endianness is a property of a scalar, 
while LDL has a much more general framework ('flip' operator) to model 
endianness. As in LDL, it is possible for layouts to have annotations - 
but I strongly feel that important representation distinctions should be 
captured in the language rather than in another meta-language. In other 
words, the more we make a description just about a 'bunch of bits', the 
less said description has to say about the behavior that is attached to 
such bits - meaning that this info will have to be recovered somehow, by 
using extra metadata, or by other means. This is why I opted for 
reifying information such as the scalar kind and the pointee information 
in the description itself (while LDL delegates that job to a suitably 
named annotation).

We plan to start working on the new API soon, and to replace the 
existing internal layout API with a public, and (hopefully :-)) 
well-specified one. Once that's done, we'll also work to upgrade the 
layout grammar and to make the necessary adjustments to jextract in 
order to emit the right set of annotations/native descriptors. This work 
might take place in an experimental branch first, as to minimize 
disruptions.

[1] - http://cr.openjdk.java.net/~mcimadamore/panama/layout-grammar.txt
[2] - http://cr.openjdk.java.net/~jrose/panama/minimal-ldl.html
[3] - 
http://mail.openjdk.java.net/pipermail/panama-dev/2018-January/000915.html
[4] - 
http://mail.openjdk.java.net/pipermail/panama-dev/2018-February/000940.html

Maurizio