Implementing Towards Better PEP/Serialization

Sun Apr 18 11:55:38 UTC 2021

I've finished the first draft of a data serialisation Java binding
interface library [1]. When I wrote the first message on the topic in
August last year I was starting on a new serialisation format to scratch an
itch I've had. After a nine month detour, I think I might be ready to
return to the original task. Before doing that, I thought I'd share where I
ended up with creating a data binding library.

In a previous post I said that my exploration of "what is data" resulted in
realising that algebraic data types was underlying all data schemas. I
ended up with Records (Fields with either required, optional or Array),
Atoms and Unions. So my data library can now automatically serialise
something like:

public static sealed interface Shape permits Point,Circle {}  // sealed
interface is a union.
public static record Point(int x, int y) implements Shape {}
public static record Circle(int x, int y, int radius) implements Shape{}
public static record ShapeList(List<Shape> list) {}

So Java 16+ provides the fundamentals for better serialisation. Yay. The
library can get all the required metadata through reflection and could
generate something like a JSON-Schema from the above. However, I wanted to
target Java 11+, and there's still some things Java doesn't provide. So a
lot of the library is aimed at providing a consistent interface to make
things look like Records, Unions, Atoms, Fields and Arrays. For example,
adding a @Record annotation to the following allows the library to access
the class as a record through the same interfaces:

@Record
class Point {
  private final int x;
  private final int y;

  // order of fields based on constructor using byte code analysis to match
accessors
  public Point(int x, int y) { this.x = x; this.y = y; }
  public int x() { return x; }
  public int y() { return y; }
}

This also works on POJOs:

@Record
@FieldOrder({ "x", "y" })  // defaults to alpha order.
Class Point {
  private int x;
  Private int y;
  public void setX(int x) { this.x = x; }
  public int getX() { return x; }
  public void setY(int y) { this.y = y; }
  public int getY() { return y; }
}

Through the library, the three versions of Record above end up with the
same MethodHandle constructor, (int x, int y):RecordPoint. In a similar way
@Union annotation can be added to interfaces or abstract base classes to
create a similar construct to sealed interfaces/classes:

@Union({ Point.class, Circle.class }) // Listed classes make it sealed
union.
public interface Shape {
  public int x();
  public int y();
}

Or as an abstract class:

@Union({ Point.class, Circle.class })
public abstract class Shape {
  private final int x;
  private final int y;
  public Shape(int x, int y) { this.x = x; this.y = y; }
  public int x() { return x; }
  public int y() { return y; }
}

As Java doesn't have a concept of a parameter union, the annotation can be
added to fields:

public record SomeThing( @Union({ String.class, Integer.class }) Object
identifier ) {}

The MethodHandle constructor and accessor will ensure that only String or
Integer are passed in or returned.

There's more work to do on atomic values, but the general concept of
Projected Embedded Pairs discussed last year to define paired MethodHandles
of toData and toObject has worked well. Adding an @Atom annotation can be
used to define how the atom should access/create the value. For instance:

public class IntAtom {
  private final int id;
  private IntAtom(int id) { this.id = id; }
  private static final Map<Integer, IntAtom> atomList = new HashMap<>();

  @Atom
  public int id() { return id; }

  @Atom
  public static final IntAtom getAtom(int id) {
    IntAtom atom = atomList.get(id);
    if (atom == null) { atom = new IntAtom(id); atomList.put(id, atom); }
    return atom;
  }
}

For record fields, the complexities of primitives, nullable and Optional
made it a little harder to create a uniform constructor/access pairs. For
the constructor it made sense to use null as the present/not present
indicator. For the accessor I used isPresent and getter pair. This way the
library can pass through the concept of optional for nullable, Optional and
OptionalInt/Long using the same interface. For example:

public record SomeThing( @Field(required=true) String name,
Optional<String> middleName, OptionalInt value );

The constructor MethodHandle for the above ends up as ( String, String,
Integer ):SomeThing. The Optional values are "wrapped" in the constructor.
The accessor pattern MethodHandles for all fields is then using isPresent
and getter combination. The isPresent can either check for null, return
true for primitives or pass the call through to the Optional isPresent
implementation. This way the serializer doesn't need different special
cases code for different field values.

That's probably enough detail. Thanks for Brian and everyone else that gave
me some feedback along the way. I'm pretty happy with how this has turned
out. I don't think it helps solve the Serializable problem of Java, but I
think that was always a different problem from what I've been solving. I'd
appreciate any feedback on the list or direct.

Regards,
David.

[1] https://github.com/litterat/litterat/tree/main/litterat-bind

On Sat, Mar 20, 2021 at 11:45 PM David Ryan <david at livemedia.com.au> wrote:

>
> I've spent the last while refining the serialization model and looking at
> how data maps to/from Java. The model consists of:
>
> Record - Product type with fields (required or optional) of specific types.
> Union - Sum type or tagged union.
> Atom - Any atomic value that has one or more representations.
> Array - Repeating element.
> Annotations - Additional data.
>
> I've been reading the JEP tea-leaves and can see a rather interesting
> pattern emerging:
>
> JEP395 - Records (maps directly to data product types)
> JEP397 - Sealed Classes (maps directly to tagged union/sum types)
> JEP401/402 - Primitive objects (maps directly to atom types)
> JEP8261099 - Frozen arrays (provides equivalent of final array values)
>
> Combining these concepts provides a really strong data serialization
> basis. It's like you've had a plan around this all along?
>
> While investigating the mapping of tagged unions from data to/from java
> I've come up with two interesting problems. Imagine a data spec with a
> schema like (please ignore syntax):
>
> some_record: record( some_field:union( string | int ) )
>
> The idea here is that some_field is a field that can either be a string or
> int. There are a couple of ways this could be mapped to Java. Option1 as
> different fields:
>
> public record SomeRecord(int someFieldInt, String someFieldString)  {...}
>
> or using sealed type.
>
> public sealed interface SomeFieldType permits SomeFieldInt,
> SomeFieldString;
> public record SomeFieldInt( int someField ) {...}
> public record SomeFieldString( String someField ) {...}
> public record SomeRecord( SomeFieldType someField ) { ... }
>
> *Question1*: Any other ways this could be mapped?
> *Question2*: Are there any plans/thoughts/ideas to add a short-hand
> tagged union type to Java? So I can do:
>
> public record SomeRecord( int|String someField ) {...}
>
> I can sort of see that pattern matching might provide a path to supporting
> this through the language.
>
> The second problem is that java base classes that are instantiable maps to
> both records and tagged unions on the data side.
>
> *Question3*: I was wondering if that was one of the reasons why records
> are final?
>
> Take for example the following example:
>
> class Vehicle {
>    private final String make;
>    private final String model;
>    private final int year;
>    ...
> }
> class Car extends Vehicle {
>    private final int horsePower;
>    ...
> }
> class Truck extends Vehicle {
>    private final int numberOfAxles;
>    ...
> }
>
> If I was to remodel this using Records and Sealed interface I'd need to
> repeat the common fields in each record and separate the tagged union
> Vehicle from the record type. Something like:
>
> public sealed interface Vehicle permits GenericVehicle, Car, Truck {...}
> public record GenericVehicle(String make, String model, int year) {...}
> public record Car(String make, String model, int year, int horsePower)
> {...}
> public record Truck(String make, String model, int year, int
> numberOfAxles) {...}
>
> This now maps nicely into a data grammar/schema whereas base classes would
> require an additional structural type on the data side:
>
> vehicle: union( genericVehicle, car, truck );
> genericVehicle: record( string make, string model, int year );
> car: record( string make, string model, int year, int horsePower );
> truck: record( string make, string model, int year, int numberOfAxles );
>
> The upshot of this is that I'm currently contemplating not allowing
> anything but abstract base classes to be serializable in the Java PEP/bind
> library I've been working on. Obviously out of scope for amber, but is an
> interesting finding when looking at how to map between Java/OO and data
> schemas.
>
> One other conceptual mismatch between java and data I've found is
> "required vs optional" fields for records.
>
> *Question4*: Given the move to primitive objects, I was also wondering if
> Optional will also become primitive at some point?
>
> Regards,
> David.
>
>
>