Implementing Towards Better PEP/Serialization

Sun Aug 16 10:23:42 UTC 2020

I'm back a bit sooner than I expected. I've had a busy week digging into
the concepts and developing a proof of concept. It would be great to get
some feedback on the direction I'm heading and how it aligns or doesn't
align with yours. Before I discuss the POC, I wanted to go back over the
requirements and how I arrived at the current proof of concept. It should
provide the rationale and give you or anyone else a chance to poke holes in
it.

Re-stating the summarized requirements based on elements of our discussion
and the original proposal.

 - Data not objects: “applications use serialization to persist data, or to
exchange data with other applications. Not objects; data.” And "We should
seek to support serialization of data, not objects."

 - Make serialisation explicit and bring serialization into the object
model:  "If a class has an externally accessible
construction/deconstruction/access protocol, it should be easy to just say
'use that for serialization too' (records do this already.)"

 - Use the "front door" api, or let the developer specify "back door" api
and secure it:  "A class should be able to expose a serialization protocol
without exposing that as part of the public API."

 - Move away from "readObject/writeObject" mechanisms. This encodes the
form of the data in code instead of a form that can be used for automated
serialization and schema definitions.

 - Backward compatible. My requirement, not yours. It should be possible to
put together an implementation in Java 11 at minimum.

Step 1: Define Data.

Based on those requirements and much more detailed discussion in your
proposal, the pattern matching constructor/destruction pattern was
proposed. A reason you're drawn to the ctor/dtor mechanism is that it forms
an embedding-projection pair. Would it be safe to say that based on this
and reading between the lines, that what we're after is in fact an
embedding-projection pair between an Object and Data?

   e: Data -> Object/Class
   p: Object/Class -> Data

If you accept that as the core requirement, then before we go much further
we better decide what "Data" means. For the purposes of Java language
design, "Data" is not the byte stream encoding. As the proposal states, and
I agree, "the stream format is probably the least interesting part of the
serialization mechanism". Based on this, "Data" is defined as something
between the encoding and the object. However, if we use your ctor/dtor
mechanism as the example, then "Data" is the parameter list tuple. We can
then make a small leap and say that "Data" in Java is an Object[] of values
and the associated metadata (types, names and order). Once again, reading
between the lines of what "better serialization" means, I think it is and
embedded projection pair of:

  e: Object[] -> Object
     (Metadata)  (Class)

  p: Object -> Object[]
     (Class)   (metadata)

Note: It's the metadata associated with the Object[] that is really what
we're after. A serialization protocol could bypass the Object[] in-memory
representation altogether. The Object[] is just the simplest way to
represent the tuple in java for the rest of the discussion. The Java
serialization implementation has the metadata we're talking about
implemented in the ObjectStreamClass and the ObjectStreamField. So,
potentially, an aim is to create a better ObjectStreamClass that can be
used by serialization libraries without that magic it currently contains?

>From this embedded projection pair between Object and Object[] a
serialization library can come along and add the encoding:

  Serialization: Object -> Object[]    -> encoding
                        (class)     (metadata)   (schema)

  Deserialization: encoding -> Object[]    -> Object.
                           (schema)      (metadata)    (class)

On this basis, I've changed the title to "Implementing Towards Better  PEP
(Projection-Embedded Pairs)", as that's the key concept that can help
Serialization. There's another side discussion to be had regarding what if
any restriction could be placed on the elements of the Object[]. I don't
think it matters, but a class could project data elements in the array that
can't be serialized.

Step 2: Possible embedded projection pairs.

Now that I've been shown the embedded projection pair hammer, everything
looks like a nail. :) So, using the embedded projection pairs between
Object[] and Object, what mechanisms can be found to implement it using
front door APIs:

Class specified:
  Constructor:Destructor - An n-arg constructor with n-arg destructor. The
proposal suggests using this pattern, but it is not available yet.
  Constructor/Accessors - Available, but potentially difficult to match
parameters from constructor with accessors.
  Setters/Getters - Simple, but requires no-args constructor and
immutability of objects is where a lot of developers are moving.
  Encapsulated projection - The class has an alternative form and provides
constructor and accessor for the alternate form. The alternate form
recursively uses another mechanism listed here. Requires something to
inform if data is encapsulated or the encapsulation is the data.

Externally specified:
  Encapsulated embedding - An external class extracts and embeds a target
class, with the target class not having defined a direct embedded
projection pair.
  Intermediate ep-pair - A third class that provides both projection and
embedding functions between two other classes.

There's variations on the above with factory classes and facades etc, but
they generally can be fit into those categories.

Step 3: Solve the Constructor/Accessors parameter matching

>> As I mentioned previously, I've hit a bit of a road-block with my design
as I'll need a fallback solution for users on Java 11 (current target
version). An annotation on the ctor or its parameters is the likely
solution:

> Yeah,  that’s ugly but doable.  Your users won’t like you.

This comment rolled around in my head for a little while, so I looked
closer at the problem. In many cases the classes we're talking about that
have immutable fields have the following form:

   public class Point {
      private final int x;
      private final int y;

      public Point( int x, int y ) {
         this.x = x;
         this.y = y;
      }

      public int x() { return x; }
      public int y() { return y; }
   }

It is pretty clear from our perspective that the constructor parameters
match up with x,y fields and x,y accessors. However, without the names
available in the class, reflection doesn't help. If we can prove that
constructor parameters are invariant before being written to the field, we
can safely match the constructor to the fields/accessors. So doing some
deep reflection, we can implement a really simple checking for invariance
by finding the following patterns and extracting the parameter and field id.

       4: aload_0
       5: iload_1
       6: putfield      #14                 // Field x:I

This can then be matched up with the accessor:

       0: aload_0
       1: getfield      #14                 // Field x:I
       4: ireturn

Based on this the above, the class can have its "Data" made available with
no additional annotations. I'm sure there's plenty of research and
implementations for testing for parameter invariance that could be applied
here.

With a single constructor we don't *need* anything else to say the data can
be serialised. This bypasses requirement 2, "make serialization explicit",
but an annotation could be added.

Step 4. The PEP Proof of Concept (because PEP sounds better than EPP)

https://github.com/litterat/pep-java

The proof of concept is designed (still being implemented) to provide five
of the six mechanisms (obviously dtor is missing) for Object to Object[]
embedded projection pairs. It also implements the simple check for
invariance based on above.  The general usage being:

  // Create an instance object to be projected.
  Point p1 = new Point(1,2);

  // Create a context and a descriptor for the target class.
  PepContext context = new PepContext();
  PepClassDescriptor pointDescriptor = context.getDescriptor(Point.class);

  // Extract the values to an array
  Object[] values = pointDescriptor.project(p1);

  // Create the object from the values
  Point p2 = pointDescriptor.embed(values);

The PepClassDescriptor is logically equivalent to the ObjectStreamClass of
Java serialization. The PepContext is equivalent to the ObjectStreamClass
static cache. I've kept it separated so there can be different data
projections for a class based on the type of communications encoding being
used.

The PepClassDescriptor currently includes the project and embed functions,
but they are logically different things and probably should be separate
implementations.  As mentioned previously, I'll likely just use the meta
data as part of the serialization implementation and re-implement without
the intermediary Object[] later.

The documentation provided in the project README is currently the design
document, so if anyone has time/interest to read that I'd be interested in
feedback.

Step 5. Implementation

There's still plenty of work to be done on the proof of concept, but the
general idea feels like it will work well being separate to the
serialization library itself. I still think there will be a need for the
serialization library to add additional annotations which are encoding
specific, but, if the language can provide the "Data" metadata, half the
job is done.

The implementation syntax is likely to change a bit as I get to understand
how the library interacts with the actual serialization library. I can also
see some or all the concepts being part of the platform eventually. A week
ago I was skeptical that a useful separation could be achieved. I can
envisage a reflection api something like:

   DataDescriptor data = object.getClass().getDataDescriptor();
   DataFields[] fields = data.fields();

Anyway, back to the implementation. Thanks for the discussion and direction
so far, it has clearly helped.

Regards,
David.