Implementing Towards Better PEP/Serialization

Fri Dec 11 12:18:12 UTC 2020

Over the past few weeks, I've made some good progress on an end-to-end
serialization implementation. I'd be really interested in feedback from the
list.

After success with the PEP library and the JSON front end, I wanted to try
combining a few ideas from my past and a few new ideas. Combined, it's
probably a few too many ideas, but it has been an interesting set of
problems. I decided that my initial aim/focus is to build a serialization
library and format that used PEP and functionally behaves in a similar way
to Java's internal serialization format. My base test case has been the
following:

public class Point {
private final float latitude; private final float longitude;
@Data
public Point(float latitude, float longitude) { this.latitude = latitude;
this.longitude = longitude; }
public float latitude() {return latitude; }
public float longitude() { return longitude; }
}

Point p1 = new Point(-37.2333f, 144.45f);
byte[] buffer = new byte[150];
TypeOutputStream out = new TypeOutputStream(buffer);
out.writeObject(p1); out.close();
TypeInputStream in = new TypeInputStream(buffer);
Point p2 = in.readObject();

It is important that the Point is written to the stream without any prior
configuration other than the @Data annotation. And more important that the
Point is read from the stream without any prior knowledge of the stream.
There's a lot going on under the hood to get this working. I've
split/decomplected the library into the following parts:

PEP - Conceptually, PEP is the Java language/data interface. Its task is to
provide a consistent interface from which to interact with the Java
objects. This is a narrowly focused reflection interface that makes all
data look the same.

Schema - I've created a format independent schema data structure that uses
its own structures to define itself (very meta). It uses the PEP interface
to generate the data schema structure of the objects. The schema itself
uses PEP and is encoded using the same format as the data encoding. This
allows the structure of data to be encoded as it is used in the stream in a
similar way to Java serialization. In the future, the schema could also be
referenced externally or agreed dynamically between two hosts.

Interpreted Language - From the Schema data structure, a simple language
tree (lambda) is generated for both reading/writing data (the lambdas can
also be encoded to file). The tree is then 'compiled' to an interpreted
tree (also got a partially implemented MethodHandle option). This is
particularly experimental but aimed at getting as close to the concept of
"handwritten" code for encoding/decoding.

Encoder/Decoder - The final file format just calls the interpreted language
to read/write the data. I've separated the stream interface from the
underlying transport to allow reading/writing to byte array, stream, byte
buffer or others in the future.

One of the interesting challenges with the elements of serialization is the
conceptual leakage between different elements. It is very easy to let the
language influence the data schema. In addition, the data format often
influences the data schema too. I've made a point to attempt to not
intermingle the concepts. For example, a data schema has a different
namespace structure and naming convention to Java, so annotations related
to this are in the schema library instead of the PEP library. For instance
the Point could have the class annotation @SchemaType(namespace="geo",
name= "wsg84point") to give it a different namespace and definition name.
Other Schema related annotations might be String lengths and patterns that
are usually procedural implementation in Java.

I've also hit some interesting special cases with mapping between Java and
a Schema using PEP. For instance, the Optional object implementation in
Java is often just "optional" in a Schema declaration. I've been able to
create an Optional bridge to wrap/unwrap the object as part of the PEP
interface.

As David Lloyd said, there are a million considerations and elements that
go into serialization. It's probably not worth digging into all the details
of each element. I've got to that point where this is either just another
solution looking for a problem or something that is worth pursuing further.
So where to from here...

- If anyone on the list has a project/requirement for serialization that I
could try applying this to, I'd be interested to have a go. Please contact
me.

- I'm going to build out some more test cases and ensure more atomic types
are supported. I focused narrowly on the Point test case to start.

- I will spend more time improving the schema and interpreted language
design. I might post some details here, and see if anyone can provide
input. Of course, if that is starting to stray from the Amber topic too far
let me know. :)

- Anyone else got any suggestions of where to take this?

As usual, thanks for the feedback and input.
David.

I've combined all the subprojects into a single repository while lots of
changes are going on (https://github.com/litterat/litterat) if anyone is
interested in having a look.

On Fri, Nov 20, 2020 at 2:13 AM Brian Goetz <brian.goetz at oracle.com> wrote:

>
> >
> > Using the method handles and field setters, the internal library has
> > bypassed two autoboxed Floats and the creation/destruction of the
> > Object[]. For a JSON parser, I wouldn't be too worried about this as
> > text parsing is already expensive and the overhead wouldn't add much,
> > however, in a binary serialization this overhead could add up. In
> > older Java versions I've observed this type of autoboxing in
> > serialization put huge pressure on the garbage collector.
> >
> > Before getting stuck on this issue. Should I care? Will the later Java
> > compiler versions eventually see that the float values don't need to
> > be autoboxed and the Object[] could be put on the stack?
>
> I wouldn't get stuck on it.  If you're going through reflection, there
> are plenty of other costs too; if you're going through method handles,
> there's a lot more that the JIT can do to elide the boxing costs (and
> more when Valhalla comes online.)
>
> > If this PEP library is to be nice and adhere to not setting private
> > final fields directly and use the public constructor, I'm left
> > wondering if there's any way to improve the performance of the first
> > solution without waiting for the optimizer to kick in?
> >
> > The only potential solution I've thought of so far is to get the front
> > end serializer to create a MethodHandle that looks like:
> >
> > constructor( input.readFloat(), input.readFloat() );
> >
> > The problem with this is that the values must be serialized in the
> > correct order. This would potentially be ok for some binary formats.
>
> MHs can handle things like this, though getting the sequencing right is
> tricky (there's no explicit sequencing combinator.)  But this can be done.
>
>
>