State of the LDL

Fri May 1 15:21:32 UTC 2015

> As an approach, I'd like to suggest that we separate the semantic
> aspects of the layout language from the proposal for a specific
> encoding; it will be better (and easier) to come to agreement on the
> abstract model and its semantics before trying to propose an encoding.
> History has shown that the latter can often get in the way of paying
> enough attention to the former.
>
I think separating the semantic aspects from the encoding is a good idea.
We already have some kind of distinction, the goals (with exception to #4)
describe semantic aspects of the LDL, and the grammar defines the syntax.
This will be clearer in the next version of the document.

> Separately, it would also be good to separate the concepts (Layout,
> Location, etc) from the implementation strategy (abstract classes.)
>
Agreed, these are two separate discussions. My intention there was to put
out a straw man. We chose to implement our prototype with abstract classes
but we fully expect some debate in this area. We would prefer to discuss
Layout concepts and implementation strategies in a separate thread, but
there is some unavoidable overlap.

> First, some comments on the goals:
>
> > 2. The LD must specify the endianness of the layout. The bit and byte
> > endian must be consistent. Endian is specified at container
granularity. A
> > shorthand notation can be provided to specify endian for all containers
in
> > a layout.
>
> This "same for bits and bytes" restrictions seems like it would prohibit
> encodings of sequences of bytes encoded in machine-endianness, such as
> variable-length strings encoded with a length field.
>

One of the goals has been to allow portable bytecodes for protocols with a
well-defined endian (ie: network packets) while limiting the amount of
ugliness that has to be added to sun.misc.Unsafe in terms of
endian-specific read/write intrinsics.  Not supporting native endian
decreases the number of signatures {big, little, native} * {primitive type
sizes} that need to be added.

We've also viewed the existing native data as the "source of
truth" {offsets, endian, etc} that Java needs to interop with.  Given that
explicit endianness is required, we didn't initially see enough value in
native endian.  Recently, we've started to come around on native endian for
cases where java wants to serialize data offheap and read it back in the
same process without there being some other consumer / user of that data.

That being said I don't fully understand what this has to do with variable
length strings or how our endianness specification prohibits variable
length strings (more on this later). The LDL format allows us to make a
distinction between the length field and the sequence of bytes, e.g.

LD:
VLS, 10, < {
	short, 2, length, //we can specify endian for length and characters
separately
	char, 1[8], characters,
}
Although, we do have a restriction that any accessible memory must be
explicitly stated in the LD. This would be the only restriction prohibiting
variable length strings.

> Also, this is the first use of "container", which should be defined
> before first use.
>
> > 5. A container is a sequence of one or more adjacent fields.
>
> It seems we've defined fields and containers in terms of each other.  At
> this point, an unfamiliar reader will not have a real understanding of
> either, or why there are two separate concepts.  This should be
> clarified.  It would help to make the motivation for this two-level
> hierarchy more explicit.
>
In our previous discussions it became clear that we need to define rules
regarding atomicity and tearing. The proposed memory model is inspired from
the c++ memory model (http://www.hboehm.info/c++mm/), our concept of
"container" and "field" are analogous to the c++ "memory location" and
"bit-field". The LDL requires this type of distinction as it is important
to be able provide equivalent behaviour to native languages. If a field
access in C has a certain behaviour I should be able to get equivalent
behaviour in Java.

> > 6. Default alignment is the size of the largest container in the layout
> > rounded up to 2^n bits. In the case of arrays the container element
size is
> > considered.
>
> It feels like arrays are tacked on as an ancillary concern.  I can't
> imagine that this is true?
>
Ideally #6 would be something like this: The default alignment of a Layout
is the largest alignment of all the containers and unions that compose the
Layout. But our proposal does not allow for container alignments since
containers are defined in terms of size and offset. So we need to define
the default alignment in terms of container sizes. This definition works
well but it may be confusing when discussing arrays. For this reason we
feel it necessary to call it out specifically.

> > Type Information Specification:
> > The following describes how native data is associated with Java Types.
> > First we will begin by defining the Base Layout Classes.
> >
> > //Base Layout class, all Layouts subclass this
> > abstract class Layout {
> >    private Location loc;
> > }
>
> Before diving into implementation, it would be useful to motivate these
> two key concepts, Layout and Location.
>
For the purposes of this document we need to mention the existence of a
type called Layout. The specifics and motivation for this type can be
discussed in another thread.

> > 2) Pointer
>
> Pointer or Object Reference?
>
This class represents a native pointer. It lets us take a native field and
get what it is point at. A good example is a linked list node.

struct Node {
	uint64_t data;
	struct Node* next;
}

The "dereference" method in Pointer lets me get the next node.

> > 5) Primitive Arrays
>
> Valhalla will provide the ability to have generics over primitives; I
> think this means that you can merge (5) and (6) into "Array of T", and
> provide base types for each primitive layout.  This should simplify
> things a fair bit.
>
Agreed.

> > Grammar:
>
> To be honest, I am kind of mystified at the design choices for the
> grammar; it seems to be chosen to be both hard to mechanically parse
> *and* hard for humans to read!  I don't want to dwell on bikeshed issues
> like this, so I'll just say that this is definitely something that we're
> going to need to revisit before too much implementation happens.
>
> Perhaps we should take a step back:
>   - Define an abstract model for the layout language, separate from
syntax;
>   - Identify some design goals to describe the properties of a desirable
> syntax.
>
Yes, the grammar is a strawman, we should have made that more clear. We are
not committed to it, but it lets us create examples to discuss. We will
make a distinction between the semantic elements and the syntax in the
revised document.

> >    {(containers | unions)}
>
> The descriptive text doesn't say anything about unions.
>
Unions are composed of containers but with the property that they overlap
with one another. You could think of them as C/C++ unions.

> The other thing I don't see in the grammar is any way of encoding
> variable-length arrays with the length field embedded as a field.  This
> means that layouts cannot describe embedded strings or other repeating
> data, which is common (ASN.1, protocol buffers.)
>
Yes, this was purposefully left out. There are some concerns regarding
security but it is an interesting feature and it is something we need to
address. We can not encode all possible access patterns in a description of
a memory layout, there will be many features that people want but we will
not be able to support all of them. We see that there is great interest in
this feature so it is worth including this in our future discussions.
However, we need to provide a mechanism that allows one to attach user
defined behaviour to a generated Layout so that they can implement their
own access patterns.

We plan on updating the "State of the LDL" and should have the next version
out shortly.

Thanks
-Tobi