Implementing Towards Better PEP/Serialization

Thu Jan 28 06:38:28 UTC 2021

Over the last month or so I fell deeper into the rabbit hole of
serialization (if that were possible). When I was working on the language
PEP interface the question of finding "the ways that a class can be 'data'"
kept bugging me. This lead me down the path of asking what is data and what
are the ways that data structures can be specified. What was initially
going to be a review of different data and schema solutions turned into an
exercise of attempting to understand the concepts that underlie all
serialization. The output is a serialization theory document[1] where I've
documented what I could find regarding the science of serialization. I've
been in my own echo chamber writing this for the last month, so if anyone
could give me feedback or can point me to relevant papers written on the
subject of serialization, I'd like to hear it.

The main finding of this is that algebraic data types (product types and
sum types) are the basis of schema languages. They all have some form of
product type (record, sequence, struct) and Sum type (Choice, Union,
OneOf,|). For product types, each field will have a cardinality of
required, optional, or dynamic size(array). After you add in various atomic
types you cover 90% of serialization. The only other structural element I
could find is annotations (XML attributes are one example) but I haven't
found a good implementation of them.

This relates back to the original discussion regarding projected embedded
pairs from last year and finding how best to decomplect the various aspects
of serialization. The previous discussion was about finding function
pairs(to data/from data) for serialization. The current implementation I've
been working on forces all Java data classes through a homogenous record
style interface using method handles. More recently I added another meta
interface that provides a homogenous interface for array and collections.
The array interface uses the following method handle signatures:

   <ArrayType> constructor( int length );
   int size( <ArrayType> );
   <Iterator> iterator( <ArrayType> );
   void put( <Iterator>, <Array>, <Value> );
   <data> get( <Iterator> iterator, <Array> );

These signatures allow wrapping both Java array types and collections. The
only downside is that the size must be known prior to construction to
account for array implementations. The following examples of using the
array interface is taken from PepMapMapper[2] which serializes to/from a
Map:

  Object arrayData = object;
  int length = (int) arrayClass.size().invoke(arrayData);
  Object[] outputArray = new Object[length];
  Object iterator = arrayClass.iterator().invoke(arrayData);

  PepDataClass arrayDataClass = arrayClass.arrayDataClass();

  for (int x = 0; x < length; x++) {
     Object av = arrayClass.get().invoke(iterator, arrayData);
     outputArray[x] = toMap(arrayDataClass, av);
  }

Reading and constructing an array/collection:

  Object[] inputArray = (Object[]) data;

  int length = inputArray.length;
  Object arrayData = arrayClass.constructor().invoke(length);
  Object iterator = arrayClass.iterator().invoke(arrayData);

  PepDataClass arrayDataClass = arrayClass.arrayDataClass();

  for (int x = 0; x < length; x++) {
    arrayClass.put().invoke(iterator, arrayData, toObject(arrayDataClass,
inputArray[x]));
  }

With this, product types (records, etc) and Array/Collections are
implemented (of the four identified types). There's more work to
investigate how Sum types and annotations map to Java. But if you accept my
conclusions above regarding the reduced set of concepts in serialization
then the homogenous function pairs (to data/from data) for record, array,
annotation, and various atomic types will cover all scenarios. My
conclusion is that I've found all the ways data can be specified and now
finding the ways a class can be data is a lot easier to define and can be
incrementally added.

The next step is to investigate how new atomic types can be specified by
the same type system (allowing new atomic types to be added). A related
topic is to investigate how macros or lambdas might be relevant in
serialization specifications. I've got a feeling lambdas/macros could be
useful for data specifications, but I just haven't worked out how yet.

This is working towards a serialization model and format that encodes the
schema of types prior to first use in the stream. In my review of data
formats, it is interesting that Java's native serialization is the only
example of a popular format that does this. The plan is to have a binary
schema format that is extensible that allows both new atomic types and meta
types and allow dynamic type agreement/resolution between client/server
(this work was done in a previous implementation).

Finally, I think the relevance to Amber and this mailing list is feeling
well and truly tenous. I'm documenting my progress here[3], but won't post
any longer unless I think there's a higher degree of relevance. Thanks for
the input and feedback so far! Also, I'm currently looking for work if
anyone needs a serialization specialist. :)

Thanks,
David.

[1] https://github.com/litterat/litterat/blob/main/litterat-theory.md
[2]
https://github.com/litterat/litterat/blob/main/litterat-json/src/main/java/io/litterat/json/JsonMapper.java
[3] https://github.com/litterat/litterat/blob/main/100DaysOfCode.md