analysis of message protocol schemas

Thu Feb 1 16:15:37 UTC 2018

Hi,
this is a follow up to my analysis on the use of layout descriptions in 
FFI support. This time, the focus of the analysis is not FFI but message 
protocols. There are a wide variety of frameworks which support 
encoding/decoding of messages in an efficient way (both in terms of 
footprint of the binary representation and in terms of overhead 
associated to encoding/decoding). Such frameworks often include a 
_language schema_ that allows programmers to specify what a message 
looks like - in that sense, a schema is similar, in spirit, to a layout 
description and, as such, it makes sense to cover such schema languages 
in this analysis.

Note: as there are many protocol schemas out there, the list below is 
not meant to be exhaustive; it is meant to reflect the most popular 
choices used at the time this email was written - as such, it's likely 
that your favorite choice might be missing from my sampling :-)

1) Protobuf [1]

Protobuf has been for long the de facto standard when it came to message 
serialization; it has a schema language [2] that allows programmers to 
define the layout of the messages they wish to exchange over the wire, 
and it has a compiler that takes this schema definition and outputs a 
set of stub classes in any desired target language; this strategy 
(schema language + stub compiler) is also adopted by most of the 
alternative schemes discussed in this document.

The domain of scalar types that can be modeled is not too surprising; 
there are int, double, floats, signed and unsigned. In other words, 
choices such as FP-ness and signed-ness are reified in the schema 
language, which leads to some interesting consequences when it comes to 
encoding.

Structured types are defined in Protobuf as 'messages', and messages can 
also be nested inside other messages, to achieve some kind of 
encapsulation. Messages can refer to the name of other messages, as a 
way to embed structured data.

When it comes to encoding, the scheme is pretty simple: a Protobuf 
message is essentially a binary key/value, where the key identifies the 
type tag of the message field, and the value represents the payload 
associated with such field. Values are encoded in a straightforward 
fashion, with few exceptions that will be shown below; numbers are 
stored in little-endian format, strings are encoded as variable-length 
entities, where a length field is followed by a sequence of payload 
bytes, etc.

Encoding of numbers is interesting [3], and part of what makes Protobuf 
so compact when it comes to the on-the-wire representation. Integers are 
represented using the varint format, where a number is encoded as a 
sequence of bytes, whose least significant bit is set if another byte 
follows. Bytes of a varint are encoded from least significant group to 
most significant group. That means that small numbers don't take up much 
space (e.g. 1 byte), and the space requirements grow with the size of 
the number to be encoded.

Signed numbers are problematic, because of the 2's complement 
representation, which turns even small negative integer numbers into big 
'unsigned' numbers. To overcome this problem, Protobuf adds a bunch of 
signed types which use a different encoding, called zigzag, where a 
signed number is encoded as an unsigned value, according to the 
following scheme:

0 is encoded as 0
-1 is encoded as 1
1 is encoded as 2
-2 is encoded as 3
...

This allows to encode small negative ints as small values, for which the 
varint representation already shines.

In other words, Protobuf is able to encode messages in a truly compact 
way thanks to its variable-sized encodings; however (and this is a point 
that is valid for some of the other solutions we discuss in this 
analysis), this comes with a cost: encoding and decoding messages is 
relatively expensive.

2) Thrift [4]

Similarly to Protobuf, Thrift comes with a stub compiler which generates 
a set of stubs given a language schema. The types supported in the 
schema [5] are the usual suspects: bool, byte, signed int (of various 
sizes), strings. Notably, there's no support for unsigned integers 
(which Protobuf optimizes explicitly); compound types such as structs, 
lists and map are also there.

One stated goal of Thrift encoding (mentioned in the white-paper [6]) is 
for decoding/encoding to be size-independent, so that clients 
serialize/deserialize structs bottom up w/o needed to wait for the 
entire thing - e.g. in a stream-oriented fashion. This goal will have 
some consequences on how the encoding for compound types is defined.

Encoding [7] is straightforward as there no use of variable-sized 
encoding schemes (unlike Protobuf).  Instead, scalar are encoded as 1, 
2, 4, 8 bytes values, depending on the schema type. Structs are 
represented as a sequence of fields followed by a 'stop-field' sequence. 
Since each field header contains both info on its type and its position 
in the overall struct, it is possible to encode/decode structs out of order.

Lists are encoded in a more straightforward way, with a size field 
followed by the elements.

Compactness can be achieved by adopting a compact Thrift protocol [8] 
which uses compact variants for e.g. small structs, and also 
variable-sized encoding (zig-zag + varint, as in Protobuf). As noted for 
Protobuf, the more compact the representation, the more overhead is 
associated with the encoding and decoding stages.

3) MsgPack [9]

MsgPack belongs to the same family as Protobuf and Thrift; the most 
important difference here is that MsgPack doesn't have a toolchain as in 
the previous two solutions - that is, instead of using a language schema 
and a stub compiler, MsgPack simply relies on an API which is then 
implemented in different languages (among which, of course, Java [10]).

The MsgPack specifications [11] define the usual supported schema types: 
int, float, array, map; there's less choice of types here (e.g. no 
distinction between signed and unsigned), and I believe this is a 
deliberate choice meant to reflect the basic types available in JSON.

The encoding defines a mapping between the schema types and the so 
called format types, that is, the type to be used for encoding purposes. 
Interestingly, format types are organized in families and there are 
several ways to encode the same schema type, depending on its size - 
e.g. there are some binary tags such as fixint, fixarray, fixstr, which 
are used to encode compact/small data, and that's also how compaction is 
achieved.

4) Cap'n'proto [12]

The main goal of Cap'n'proto is to be able to achieve negligible 
overhead when encoding and decoding messages. As we have seen, the 
compaction achieved with certain variable-size schemes comes at a cost: 
the serializer/deserializer has to spend time (and memory) in order to 
e.g. encode/decode a number into/from a varint. The need to remove 
overhead of course results in very different decisions when it comes to 
encoding; we attempt to cover some of those decision below.

As we've seen with Protobuf and Thrift, Cap'n'Proto also comes with its 
own compiler, which turns a message definition in Cap'n'proto's own 
schema language into a set of stub classes in a target language. The set 
of types [13] expressible in Cap'n'proto is actually very rich. There 
are the usual scalar types: void, bool, int (signed and unsigned),  
floating point. List and strings are also among the set of builtin 
types. If builtin types are not enough, users can define their own 
custom structs. Interestingly, the schema language Cap'n'proto uses 
supports generics, so it is possible to define a generic struct, or to 
specify the element type of a builtin list.

In Cap'n'proto, a message is essentially encoded [14] as a sequence of 
segments, segments can have near pointers (intra segment) or far 
pointers (inter-segment) and the assumption is that a segment is loaded 
in contiguous region of memory. That guarantees that consistency of 
near-pointers is preserved. There's no constraint for different segments 
to be loaded in consecutive region of memory, as far-pointers allow to 
encode the ID of the target segment, so that it can be dynamically 
retrieved.

Encoding of scalar types is unsurprising - there's no use of 
variable-size encoding, and, as such, no attempt to reach the same level 
of compactness as, say, Protobuf - compactness is achieved with other 
means, as we shall see later. The fact that the layout of scalar types 
is fixed size is of course a consequence of the cap'n'proto design to 
keep encoding/decoding overhead as low as possible.

Structs and list are represented as pointers to some data, where the 
pointer contains some info about the pointee (e.g. in case of the struct 
where the fields can be found, in case of a list how many elements there 
are and where they can be found). This encoding is a bit more convoluted 
than in the other solutions we have seen so far, but it has some 
advantages, as it doesn't force a struct payload to be stored near where 
it is referenced.

Crucially, layout of fields in the 'data' part of a struct is determined 
by the Cap'n'proto compiler, following some complex scheme which 
respects alignment and other constraints; in a way, it is fair to assume 
that the layout of a Cap'n'proto struct is designed to be as close as 
possible to the in-memory representation of same struct, so as to 
eliminate as much overhead as possible when translating between the two.

Compression is achieved using a compression scheme that allows to get 
rid of most zeros from the packet. A byte mask indicates whether the 
following bytes are all-zero or not - so there's an extra byte tag for 8 
bytes of payload; since most content bytes are actually zero, this works 
pretty well in practice, and has a predictable encoding/decoding 
performance model.

5) Flatbuffers [15]

Flatbuffers can be thought of as the spiritual successor of Protobuf - 
similarly to Cap'n'proto, the main goal of flatbuffer is to reduce 
encoding/decoding overhead, and, as we shall see, this is achieved in a 
very similar fashion.

Again, there's a schema language and a schema compiler which generates 
stubs in a variety of target languages. The types [16] available in the 
schema language are the usual selection of scalar types: int (signed, 
unsigned), floating point, booleans; there are many sizes available for 
each type, so we have byte, short, int, long, in a way that is 
reminiscent of Java primitives. Other builtin types include strings, 
vectors (which use generic types, as with Cap'n'Proto). Custom types can 
be defined using tables and structs, which are essentially a sequence of 
name/value pairs. The only difference between tables and structs is that 
structs do not allow for optional fields, which, as we shall see, will 
greatly impact the encoding strategy.

Encoding of scalar is straightforward, follows little-endian scheme and 
uses a number of bytes which is explicit in the schema (e.g. 'byte' 
means 8 bits, 'long' means 64 bits). Vectors and strings have size 
prepended to them. The encoding scheme for tables is somewhat 
reminiscent of Cap'n'proto structs, as tables are stored as pointer to a 
so called v-tables which might be elsewhere. The v-table is necessary as 
not all fields are present, so a layer of indirection is needed. In 
Flatbuffers, since structs cannot have optional fields, this indirection 
is not required - which allows structs to be inlined in their parents.

In terms of packing, Flatbuffers encoding doesn't say much - that's 
deliberate, as the fact that a struct payload is separated from the 
struct's v-table, allows different implementation to e.g. reorder struct 
fields and/or use different, more compact representations of them, 
depending on the use cases.

This benchmark [18] gives an idea of the difference between Flatbuffers 
and Protobuf-like solutions; also, from this document [19] it is 
possible to infer that Flatbuffers and Cap'n'proto are equally efficient 
when it comes to serialization/deserialization performances.

Summing up, I believe a number of points emerged during this analysis; 
first, the closer the on-the-wire representation is w.r.t. the in-memory 
representation of the same entity, the lesser the overhead for 
encoding/decoding that entity. This suggests that there's indeed a 
demand for Panama to cover more than just 'native structs' and to have 
its layout descriptions be rich enough to be able to express 
'well-behaved' protocol formats (more on that below).

Another important point is that the builtin types supported in these 
schemas are very similar to each other; most of the schemas analyzed 
here have boolean, ints, floats and strings. Most of the schemas (1, 2, 
4 and 5) reified the distinction between signed and unsigned data, and 
all of them distinguished between integral values and floating-point 
ones, a distinction which is also very important in terms of FFI/ABI 
mapping.

Endianness is always strictly specified - which seems to suggest that 
whatever layout description Panama decides to support, there should be a 
way to speak about the endianness of a given layout substructure.

In terms of on-the-wire representations, there's a clear separations 
between two families of frameworks: Protobuf-like frameworks (1, 2, 3) 
attempt to reduce the size of the on-the-wire representation as much as 
possible by adopting variable-size encodings for numbers and also by 
providing compact representations for lists, arrays and structs. In such 
cases, it would be desirable for the layout description to reach out at 
the sub-byte structure, if for no other reason than to simply annotate 
the various bits in a given byte (e.g. a varint group is 7 bits of 
payload plus a 1 bit tag which indicates whether more stuff is needed). 
Perhaps such a use case could be covered by exploiting bitfields, which 
is also common primitive in FFIs.

On the other hand, frameworks such as Cap'n'proto (4) and Flatbuffers 
(5), have a much more straightforward mapping between schema and on-the 
wire format, as they do not use variable-size encoding for numbers; 
however such schemes do need ways to denote references/pointers to other 
parts of the layout, so it seems like that's another primitive that a 
layout description would require.

But there's also another big distinction between these two families of 
protocols; as we've seen, both Cap'n'proto and Flatbuffers have schema 
definitions which are much closer to what the in-memory representation 
of a packet would look like; not only this minimizes encoding/decoding 
overhead, but it also allow for random access into the contents of a 
packet. In my opinion, this should allow Panama to model packets 
generated by those schemas in a much more natural and attractive way.

Finally, both Cap'n'proto and Flatbuffers support some kind of support 
for generic types in schemas, which seem to hint at the fact that layout 
descriptions should be parametric in both size (e.g. variable-length 
array) and type (e.g. generic struct) - although that's likely to be an 
advanced use case.

[1] - https://developers.google.com/protocol-buffers/
[2] - https://developers.google.com/protocol-buffers/docs/proto3
[3] - https://developers.google.com/protocol-buffers/docs/encoding
[4] - https://thrift.apache.org/
[5] - https://thrift.apache.org/docs/types
[6] - https://thrift.apache.org/static/files/thrift-20070401.pdf
[7] - 
https://github.com/apache/thrift/blob/master/doc/specs/thrift-binary-protocol.md
[8] - 
https://github.com/apache/thrift/blob/master/doc/specs/thrift-compact-protocol.md
[9] - https://msgpack.org/index.html
[10] - https://github.com/msgpack/msgpack-java
[11] - https://github.com/msgpack/msgpack/blob/master/spec.md
[12] - https://capnproto.org
[13] - https://capnproto.org/language.html
[14] - https://capnproto.org/encoding.html
[15] - https://google.github.io/Flatbuffers/
[16] - 
https://google.github.io/Flatbuffers/Flatbuffers_guide_writing_schema.html
[17] - https://google.github.io/Flatbuffers/Flatbuffers_internals.html
[18] - https://google.github.io/Flatbuffers/Flatbuffers_benchmarks.html
[19] - https://capnproto.org/news/2014-06-17-capnproto-Flatbuffers-sbe.html