analysis of message protocol schemas
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Thu Feb 1 16:15:37 UTC 2018
Hi,
this is a follow up to my analysis on the use of layout descriptions in
FFI support. This time, the focus of the analysis is not FFI but message
protocols. There are a wide variety of frameworks which support
encoding/decoding of messages in an efficient way (both in terms of
footprint of the binary representation and in terms of overhead
associated to encoding/decoding). Such frameworks often include a
_language schema_ that allows programmers to specify what a message
looks like - in that sense, a schema is similar, in spirit, to a layout
description and, as such, it makes sense to cover such schema languages
in this analysis.
Note: as there are many protocol schemas out there, the list below is
not meant to be exhaustive; it is meant to reflect the most popular
choices used at the time this email was written - as such, it's likely
that your favorite choice might be missing from my sampling :-)
1) Protobuf [1]
Protobuf has been for long the de facto standard when it came to message
serialization; it has a schema language [2] that allows programmers to
define the layout of the messages they wish to exchange over the wire,
and it has a compiler that takes this schema definition and outputs a
set of stub classes in any desired target language; this strategy
(schema language + stub compiler) is also adopted by most of the
alternative schemes discussed in this document.
The domain of scalar types that can be modeled is not too surprising;
there are int, double, floats, signed and unsigned. In other words,
choices such as FP-ness and signed-ness are reified in the schema
language, which leads to some interesting consequences when it comes to
encoding.
Structured types are defined in Protobuf as 'messages', and messages can
also be nested inside other messages, to achieve some kind of
encapsulation. Messages can refer to the name of other messages, as a
way to embed structured data.
When it comes to encoding, the scheme is pretty simple: a Protobuf
message is essentially a binary key/value, where the key identifies the
type tag of the message field, and the value represents the payload
associated with such field. Values are encoded in a straightforward
fashion, with few exceptions that will be shown below; numbers are
stored in little-endian format, strings are encoded as variable-length
entities, where a length field is followed by a sequence of payload
bytes, etc.
Encoding of numbers is interesting [3], and part of what makes Protobuf
so compact when it comes to the on-the-wire representation. Integers are
represented using the varint format, where a number is encoded as a
sequence of bytes, whose least significant bit is set if another byte
follows. Bytes of a varint are encoded from least significant group to
most significant group. That means that small numbers don't take up much
space (e.g. 1 byte), and the space requirements grow with the size of
the number to be encoded.
Signed numbers are problematic, because of the 2's complement
representation, which turns even small negative integer numbers into big
'unsigned' numbers. To overcome this problem, Protobuf adds a bunch of
signed types which use a different encoding, called zigzag, where a
signed number is encoded as an unsigned value, according to the
following scheme:
0 is encoded as 0
-1 is encoded as 1
1 is encoded as 2
-2 is encoded as 3
...
This allows to encode small negative ints as small values, for which the
varint representation already shines.
In other words, Protobuf is able to encode messages in a truly compact
way thanks to its variable-sized encodings; however (and this is a point
that is valid for some of the other solutions we discuss in this
analysis), this comes with a cost: encoding and decoding messages is
relatively expensive.
2) Thrift [4]
Similarly to Protobuf, Thrift comes with a stub compiler which generates
a set of stubs given a language schema. The types supported in the
schema [5] are the usual suspects: bool, byte, signed int (of various
sizes), strings. Notably, there's no support for unsigned integers
(which Protobuf optimizes explicitly); compound types such as structs,
lists and map are also there.
One stated goal of Thrift encoding (mentioned in the white-paper [6]) is
for decoding/encoding to be size-independent, so that clients
serialize/deserialize structs bottom up w/o needed to wait for the
entire thing - e.g. in a stream-oriented fashion. This goal will have
some consequences on how the encoding for compound types is defined.
Encoding [7] is straightforward as there no use of variable-sized
encoding schemes (unlike Protobuf). Instead, scalar are encoded as 1,
2, 4, 8 bytes values, depending on the schema type. Structs are
represented as a sequence of fields followed by a 'stop-field' sequence.
Since each field header contains both info on its type and its position
in the overall struct, it is possible to encode/decode structs out of order.
Lists are encoded in a more straightforward way, with a size field
followed by the elements.
Compactness can be achieved by adopting a compact Thrift protocol [8]
which uses compact variants for e.g. small structs, and also
variable-sized encoding (zig-zag + varint, as in Protobuf). As noted for
Protobuf, the more compact the representation, the more overhead is
associated with the encoding and decoding stages.
3) MsgPack [9]
MsgPack belongs to the same family as Protobuf and Thrift; the most
important difference here is that MsgPack doesn't have a toolchain as in
the previous two solutions - that is, instead of using a language schema
and a stub compiler, MsgPack simply relies on an API which is then
implemented in different languages (among which, of course, Java [10]).
The MsgPack specifications [11] define the usual supported schema types:
int, float, array, map; there's less choice of types here (e.g. no
distinction between signed and unsigned), and I believe this is a
deliberate choice meant to reflect the basic types available in JSON.
The encoding defines a mapping between the schema types and the so
called format types, that is, the type to be used for encoding purposes.
Interestingly, format types are organized in families and there are
several ways to encode the same schema type, depending on its size -
e.g. there are some binary tags such as fixint, fixarray, fixstr, which
are used to encode compact/small data, and that's also how compaction is
achieved.
4) Cap'n'proto [12]
The main goal of Cap'n'proto is to be able to achieve negligible
overhead when encoding and decoding messages. As we have seen, the
compaction achieved with certain variable-size schemes comes at a cost:
the serializer/deserializer has to spend time (and memory) in order to
e.g. encode/decode a number into/from a varint. The need to remove
overhead of course results in very different decisions when it comes to
encoding; we attempt to cover some of those decision below.
As we've seen with Protobuf and Thrift, Cap'n'Proto also comes with its
own compiler, which turns a message definition in Cap'n'proto's own
schema language into a set of stub classes in a target language. The set
of types [13] expressible in Cap'n'proto is actually very rich. There
are the usual scalar types: void, bool, int (signed and unsigned),
floating point. List and strings are also among the set of builtin
types. If builtin types are not enough, users can define their own
custom structs. Interestingly, the schema language Cap'n'proto uses
supports generics, so it is possible to define a generic struct, or to
specify the element type of a builtin list.
In Cap'n'proto, a message is essentially encoded [14] as a sequence of
segments, segments can have near pointers (intra segment) or far
pointers (inter-segment) and the assumption is that a segment is loaded
in contiguous region of memory. That guarantees that consistency of
near-pointers is preserved. There's no constraint for different segments
to be loaded in consecutive region of memory, as far-pointers allow to
encode the ID of the target segment, so that it can be dynamically
retrieved.
Encoding of scalar types is unsurprising - there's no use of
variable-size encoding, and, as such, no attempt to reach the same level
of compactness as, say, Protobuf - compactness is achieved with other
means, as we shall see later. The fact that the layout of scalar types
is fixed size is of course a consequence of the cap'n'proto design to
keep encoding/decoding overhead as low as possible.
Structs and list are represented as pointers to some data, where the
pointer contains some info about the pointee (e.g. in case of the struct
where the fields can be found, in case of a list how many elements there
are and where they can be found). This encoding is a bit more convoluted
than in the other solutions we have seen so far, but it has some
advantages, as it doesn't force a struct payload to be stored near where
it is referenced.
Crucially, layout of fields in the 'data' part of a struct is determined
by the Cap'n'proto compiler, following some complex scheme which
respects alignment and other constraints; in a way, it is fair to assume
that the layout of a Cap'n'proto struct is designed to be as close as
possible to the in-memory representation of same struct, so as to
eliminate as much overhead as possible when translating between the two.
Compression is achieved using a compression scheme that allows to get
rid of most zeros from the packet. A byte mask indicates whether the
following bytes are all-zero or not - so there's an extra byte tag for 8
bytes of payload; since most content bytes are actually zero, this works
pretty well in practice, and has a predictable encoding/decoding
performance model.
5) Flatbuffers [15]
Flatbuffers can be thought of as the spiritual successor of Protobuf -
similarly to Cap'n'proto, the main goal of flatbuffer is to reduce
encoding/decoding overhead, and, as we shall see, this is achieved in a
very similar fashion.
Again, there's a schema language and a schema compiler which generates
stubs in a variety of target languages. The types [16] available in the
schema language are the usual selection of scalar types: int (signed,
unsigned), floating point, booleans; there are many sizes available for
each type, so we have byte, short, int, long, in a way that is
reminiscent of Java primitives. Other builtin types include strings,
vectors (which use generic types, as with Cap'n'Proto). Custom types can
be defined using tables and structs, which are essentially a sequence of
name/value pairs. The only difference between tables and structs is that
structs do not allow for optional fields, which, as we shall see, will
greatly impact the encoding strategy.
Encoding of scalar is straightforward, follows little-endian scheme and
uses a number of bytes which is explicit in the schema (e.g. 'byte'
means 8 bits, 'long' means 64 bits). Vectors and strings have size
prepended to them. The encoding scheme for tables is somewhat
reminiscent of Cap'n'proto structs, as tables are stored as pointer to a
so called v-tables which might be elsewhere. The v-table is necessary as
not all fields are present, so a layer of indirection is needed. In
Flatbuffers, since structs cannot have optional fields, this indirection
is not required - which allows structs to be inlined in their parents.
In terms of packing, Flatbuffers encoding doesn't say much - that's
deliberate, as the fact that a struct payload is separated from the
struct's v-table, allows different implementation to e.g. reorder struct
fields and/or use different, more compact representations of them,
depending on the use cases.
This benchmark [18] gives an idea of the difference between Flatbuffers
and Protobuf-like solutions; also, from this document [19] it is
possible to infer that Flatbuffers and Cap'n'proto are equally efficient
when it comes to serialization/deserialization performances.
Summing up, I believe a number of points emerged during this analysis;
first, the closer the on-the-wire representation is w.r.t. the in-memory
representation of the same entity, the lesser the overhead for
encoding/decoding that entity. This suggests that there's indeed a
demand for Panama to cover more than just 'native structs' and to have
its layout descriptions be rich enough to be able to express
'well-behaved' protocol formats (more on that below).
Another important point is that the builtin types supported in these
schemas are very similar to each other; most of the schemas analyzed
here have boolean, ints, floats and strings. Most of the schemas (1, 2,
4 and 5) reified the distinction between signed and unsigned data, and
all of them distinguished between integral values and floating-point
ones, a distinction which is also very important in terms of FFI/ABI
mapping.
Endianness is always strictly specified - which seems to suggest that
whatever layout description Panama decides to support, there should be a
way to speak about the endianness of a given layout substructure.
In terms of on-the-wire representations, there's a clear separations
between two families of frameworks: Protobuf-like frameworks (1, 2, 3)
attempt to reduce the size of the on-the-wire representation as much as
possible by adopting variable-size encodings for numbers and also by
providing compact representations for lists, arrays and structs. In such
cases, it would be desirable for the layout description to reach out at
the sub-byte structure, if for no other reason than to simply annotate
the various bits in a given byte (e.g. a varint group is 7 bits of
payload plus a 1 bit tag which indicates whether more stuff is needed).
Perhaps such a use case could be covered by exploiting bitfields, which
is also common primitive in FFIs.
On the other hand, frameworks such as Cap'n'proto (4) and Flatbuffers
(5), have a much more straightforward mapping between schema and on-the
wire format, as they do not use variable-size encoding for numbers;
however such schemes do need ways to denote references/pointers to other
parts of the layout, so it seems like that's another primitive that a
layout description would require.
But there's also another big distinction between these two families of
protocols; as we've seen, both Cap'n'proto and Flatbuffers have schema
definitions which are much closer to what the in-memory representation
of a packet would look like; not only this minimizes encoding/decoding
overhead, but it also allow for random access into the contents of a
packet. In my opinion, this should allow Panama to model packets
generated by those schemas in a much more natural and attractive way.
Finally, both Cap'n'proto and Flatbuffers support some kind of support
for generic types in schemas, which seem to hint at the fact that layout
descriptions should be parametric in both size (e.g. variable-length
array) and type (e.g. generic struct) - although that's likely to be an
advanced use case.
[1] - https://developers.google.com/protocol-buffers/
[2] - https://developers.google.com/protocol-buffers/docs/proto3
[3] - https://developers.google.com/protocol-buffers/docs/encoding
[4] - https://thrift.apache.org/
[5] - https://thrift.apache.org/docs/types
[6] - https://thrift.apache.org/static/files/thrift-20070401.pdf
[7] -
https://github.com/apache/thrift/blob/master/doc/specs/thrift-binary-protocol.md
[8] -
https://github.com/apache/thrift/blob/master/doc/specs/thrift-compact-protocol.md
[9] - https://msgpack.org/index.html
[10] - https://github.com/msgpack/msgpack-java
[11] - https://github.com/msgpack/msgpack/blob/master/spec.md
[12] - https://capnproto.org
[13] - https://capnproto.org/language.html
[14] - https://capnproto.org/encoding.html
[15] - https://google.github.io/Flatbuffers/
[16] -
https://google.github.io/Flatbuffers/Flatbuffers_guide_writing_schema.html
[17] - https://google.github.io/Flatbuffers/Flatbuffers_internals.html
[18] - https://google.github.io/Flatbuffers/Flatbuffers_benchmarks.html
[19] - https://capnproto.org/news/2014-06-17-capnproto-Flatbuffers-sbe.html
More information about the panama-dev
mailing list