State of the LDL

Fri Apr 24 17:14:18 UTC 2015

State of the Layout Descriptor Language

This is a proposal to specify a layout descriptor language (LDL) for the
Java Virtual Machine. The contents of this proposal are based on the LDL
discussions from the panama-spec-experts mailing list. The purpose of this
proposal is to summarize the discussions in the mailing list and to provide
a general outline for the future direction of this work.

Background:
Project Panama proposes to create a new standard way to connect the JVM and
foreign (non-Java) APIs. The project will include (and not limited to) the
following :
- native (JNR like) function calling from the JVM to C/C++
- native layouts within or outside the JVM heap
- native data descriptor for JVM (LDL)
- header file extraction tools (Groveler)
- native-oriented JIT optimizations
- tooling or wrapper interposition for safety

Introduction:
Our goal is to create a new way to interface the JVM with native APIs. To
achieve this we will need to come up with a standard way to describe native
data. Different languages have different ways of describing data (COBOL
comp-5 vs C int) and different platforms interpret the same native types
differently (C/C++ x86-32 long vs x84-64 long). There is a need for a
language that can describe native data in a general way regardless of
platform or language. For these reasons we have come up with the layout
descriptor language. The main goals of the language are the following:
1) describe the data layout
	- location/offset
	- size
	- endian
	- alignment

2) describe how the data can be accessed
	- memory model defining the rules on atomicity/tearing

3) attach type information to the data
	- describe what types the JVM must use to interpret the data

Data Layout and memory model specification:
0. Goals: where layouts are actually invariant across platforms (e.g.,
network protocols) we want to have just one layout specification.

1. The LD specification must be well defined. This means that JDKs can not
infer alignment or padding differently.

2. The LD must specify the endianness of the layout. The bit and byte
endian must be consistent. Endian is specified at container granularity. A
shorthand notation can be provided to specify endian for all containers in
a layout.

3. A field is a contiguous sequence of bits confined to a container. An
update to a field may overwrite contents of other fields within the same
container. If the enclosing container is marked as "atomic", another thread
cannot observe the value of fields in the container before the update is
complete. A field can not be greater than the size of the container. The
sizes of fields within a container must add up to the size of the
container. A field does not require a name. Accessors are only generated
for named fields.

4. Unlike C bitfield numbering (which varies based on endianness of target
platform) we'll always number fields in little-endian order; that is, "byte
a:1, b:7" (at address x) would be extracted with the expressions "loadByte
(x) & 1" and "loadByte(x) >> 1", respectively. This is LE-centric, but has
the nice property that bit 0 of a byte, short, int, or long is extracted
with the same operation (x & 1), and so on.

5. A container is a sequence of one or more adjacent fields. Changing a
field in a container will not change fields of other containers. Container
sizes must be a multiple of 8bits. There is no upper limit to the size of a
container. A container does not require a name. A container can not be
larger than the enclosing layout. The sizes of all the containers in a
layout must add up to the size of the enclosing layout. Accessors are only
generated for named containers.

6. Default alignment is the size of the largest container in the layout
rounded up to 2^n bits. In the case of arrays the container element size is
considered.

Type Information Specification:
The following describes how native data is associated with Java Types.
First we will begin by defining the Base Layout Classes.

//Base Layout class, all Layouts subclass this
abstract class Layout {
	private Location loc;
}

//pointer to native data
class Location {
	//unsafe addressing
	private final byte[] data;
	private final long offset;
}

The following are the types that can be used to represent native data in
the JVM:
1) primitive java types
- boolean
- byte
- char
- short
- int
- long
- float
- double

2) Pointer
abstract class Pointer<T> extends Layout {
	T deference(); //dereference
	void set(Pointer<T> ptr);
}

3) Raw
abstract class Raw extends Layout {
	ByteBuffer getValue();
}

4) User Defined Layouts
//These are the generated Layouts
abstract class UserDefinedLayout extends Layout;

5) Primitive Arrays
abstract class ByteArray1D extend Layout;
abstract class ByteArray2D extend Layout;
...
abstract class CharArray1D extend Layout;
abstract class CharArray2D extend Layout;

6) Layout Arrays
abstract class Array1D<T extends Layout> extends Layout {
	int elementSize;
	int numOfElements;
	T getAt(int i);
	void setAt(int i, T val);
}

abstract class Array2D<T extends Layout> extends Layout {
	int elementSize;
	int numOfRows;
	int numOfColumns;
	T getAt(int i, int i);
	T getAt(int i, int i, T val);
}

The following are the rules for associating type information with Layouts.
These rules are an addition to the previously defined rules in "Data Layout
and memory model specification"
1) Type information only represents the java type that is returned, not how
the container is loaded or stored.

2) Each named container must have a type associated with it, where a type
can be any of the types listed above (1 to 6).

3) If a container is of type Pointer or Raw or is a nested type it can not
have fields.

4) The type of a nested layout is itself. In other words, accessors must
return groveller-generated layout for nested types.

Grammar:
for layouts:
layoutName','size','[endianness','][alignment','] //endianness (<, >) LE,
BE
['{'
	{(containers | unions)}
'}']

for unions:
'U:'unionSize [unionName]
['{'
	{containers}
'},']

for containers:
[endianess,] ([typeInfo,] containerSize, [containerName,] | layoutName)
{'[' numOFElements ']'}
['{'
	{fields}
'},']

for fields:
fieldSize, [fieldName,]

for typeInfo:
- represents the java type that accesors can return for a specific
container/field
- specified in "Type Information Specification"

for layoutName:
- use java type signatures
	- e.g. Ljava/lang/String;

Examples:
1) Basic example

The following is a basic structure with two fields 'x' and 'y'.

struct A {
	uint16_t x;
	uint16_t y;
};

This structure produces the following little endian layout.

LD:

LA;, 32, < { //<-- little endian
	int, 16, x,
	int, 16, y,
}
-------------------------------------------------------
2) Array Example

The following example shows how a structure of arrays can be described in a
layout.

struct SOA {
	uint8_t a[10];
	uint16_t b[10][10]; //2-d array
};

LD:

LSOA;, 210, < {
	int, 8[10], a,
	int, 16[10][10], b,
}

-------------------------------------------------------
3) IP Header Example (bit fields)
http://en.wikipedia.org/wiki/IPv4#Header

The next example shows the layout of an IPV4Header.

LD:

LIPv4;, 160, > { //<-- big endian
	byte, 8, {
		4 ihl,
		4 version,
	},
	byte, 8, {
		2 ECN,
		6 DSCP,
	},
	short, 16, totLen,
	short, 16, iden,
	short, 16, {
		13 fragOff,
		3 flags,
	},
	byte, 8, TTL,
	byte, 8, Proto,
	short, 16, Checksum,
	int, 32, srcAddr,
	int, 32, destAddr,
	int, 32, options,
}

The following is the memory layout of the first 4 bytes:
0: version 7 - 4, ihl 3 - 0
1: DHCP 7 - 3, ECN 2 - 0
2: totLen 15 - 8
3: totLen 7 - 0

----------------------------------------------
4) Nested Struct
struct A {
	uint32_t x;
	uint32_t y;
}

struct B {
	struct A xy;
	uint32_t z;
}

LD:

LA;, 64, < {
	int, 32, x,
	int, 32, y,
}

LB;, 96, < {
	LA;, xy, //<-- A is nested in B
	int, 32, z,
}
---------------------------------------------
5) UDP Packet (nesting example 2)
http://en.wikipedia.org/wiki/User_Datagram_Protocol

LD:
LUDPPacket;, 224, > {
	LIPv4;, ipHeader, //<-- nested struct
	short, 16, srcPort,
	short, 16, destPort,
	short, 16, length,
	short, 16, checksum,
}
------------------------------------------------
6) Implicit padding example

On a 64 bit machine the compiler would add 32 bit padding between the two
fields
shown in the following structure.

struct A {
	uint32_t x;
	uint64_t y;
}

This structure would produce the following Layout:

LD:
LA;, 128, < {
	int, 32, x,
	32, //<-- unnamed fields are used to represent padding
	long, 64, y,
}
-----------------------------------------------

Future Work:
- Finalize the "Data Layout and memory model specification", there has been
a lot of progress on this in the mailing list discussions and we are close
to coming to a resolution
- Begin discussions on the "Type Information Specification", this proposal
offers a starting point for this
- Open discussions on security

Discussion points:
- abstract classes vs interfaces
There are benefits to both approaches. Abstract classes give us more
flexibility in designing the base Layout class. The Layout class can own
the Location object as well as potential security objects that we may add.
- off-heap only vs on-heap/off-heap
The ability to access flattened data on-heap is useful but does this cross
over to Valhalla territory?