Implementing Towards Better PEP/Serialization

Sun Aug 16 12:58:52 UTC 2020

Hi David,
here is my take on this
  https://gist.github.com/forax/85303febe5c6bebda4ec7dd2fb51a9e2 

The TLDR is if a class has a deconstructor (and a reconstructor), it's enough to automatically serialize/deserialize a class without all the quirks of the Serialization API.

I prefer your original idea that a record class mirrors the schema so a record instance is conceptually equivalent to a list of pairs of key/values.

This is how it works, the idea is that converting an class to a record is enough to serialize/deserialize it,
to do so a user should provide a deconstructor (instance -> record) and one or more reconstructor (record -> instance),
you can have more than one reconstructor if there are several versions (several records) used previously.

> - Data not objects: “applications use serialization to persist data, or to
> exchange data with other applications. Not objects; data.” And "We should
> seek to support serialization of data, not objects."

yep, any instance is converted to a record instance (data) before being serialized and vice versa.

> 
> - Make serialisation explicit and bring serialization into the object
> model:  "If a class has an externally accessible
> construction/deconstruction/access protocol, it should be easy to just say
> 'use that for serialization too' (records do this already.)"

I've used an annotation @Marshall to explicitly indicate that a class uses "data serialization" instead of the classical serialization.
The annotation also specify the record class used to convert an instance to a data and the record classes to convert a data to an instance of the class.

Specifying the record classes also the Marshaller to find the deconstructor/reconstructor directly (the record class give the signature of the method)
without using the reflection to list all possible methods. I've chosen that a reconstructor is a static method instead of being a constructor because
it's a little cleaner (by example you )

> 
> - Use the "front door" api, or let the developer specify "back door" api
> and secure it:  "A class should be able to expose a serialization protocol
> without exposing that as part of the public API."

The front door is the Marshaller, the backdoors are the deconstructor/reconstructor methods.

> 
> - Move away from "readObject/writeObject" mechanisms. This encodes the
> form of the data in code instead of a form that can be used for automated
> serialization and schema definitions.

Here readObject/writeObject are implemented on top of the conversion between an instance to a data and vice versa, so it can be easily extended to JSON, etc

> 
> - Backward compatible. My requirement, not yours. It should be possible to
> put together an implementation in Java 11 at minimum.

oops, my implementation uses record, so not Java 11 compatible. It's Java 14/Java 15.

Here is an example, let suppose i've a class MutablePoint
  class MutablePoint {
    int x;
    int y;

    public MutablePoint(int x, int y) {
      this.x = x;
      this.y = y;
    }
  }

If i want to serialize it, i will first add a record containing the data I want to serialize
  record Point(int x, int y) implements Serializable { }

then i wall add an annotation @Marshall and specify to use the record class for the marshalling/unmarshalling
  @Marshall(deconstruct = Point.class, reconstructs = Point.class)
  class MutablePoint {
    ...

and i will add a deconstructor and a reconstructor to indicate how to convert from a MutablePoint to a Point (and vice versa)

  @Marshall(deconstruct = Point.class, reconstructs = Point.class)
  static class MutablePoint {
    int x;
    int y;

    public MutablePoint(int x, int y) {
      this.x = x;
      this.y = y;
    }

    private Point deconstructor() {
      return new Point(x, y);   // convert to data
    }
    private static MutablePoint reconstructor(Point point) {
      return new MutablePoint(point.x, point.y);   // extract from data
    }
  }

That's all, I can now serialize/deserialize the class MutablePoint
By first creating a Marshaller (the Lookup object is used to find the constructor/deconstructor)
    var marshaller = Marshaller.of(lookup());

to serialize
    marshaller.writeObject(...ObjectOutputStream..., mutablePoint);

to deserialize
    var mutablePoint2 = (MutablePoint) marshaller.readObject(... ObjectInputStream ...);

regards,
Rémi

----- Mail original -----
> De: "David Ryan" <david at livemedia.com.au>
> À: "Brian Goetz" <brian.goetz at oracle.com>
> Cc: "amber-dev" <amber-dev at openjdk.java.net>
> Envoyé: Dimanche 16 Août 2020 12:23:42
> Objet: Re: Implementing Towards Better PEP/Serialization

> I'm back a bit sooner than I expected. I've had a busy week digging into
> the concepts and developing a proof of concept. It would be great to get
> some feedback on the direction I'm heading and how it aligns or doesn't
> align with yours. Before I discuss the POC, I wanted to go back over the
> requirements and how I arrived at the current proof of concept. It should
> provide the rationale and give you or anyone else a chance to poke holes in
> it.
> 
> Re-stating the summarized requirements based on elements of our discussion
> and the original proposal.
> 
> - Data not objects: “applications use serialization to persist data, or to
> exchange data with other applications. Not objects; data.” And "We should
> seek to support serialization of data, not objects."
> 
> - Make serialisation explicit and bring serialization into the object
> model:  "If a class has an externally accessible
> construction/deconstruction/access protocol, it should be easy to just say
> 'use that for serialization too' (records do this already.)"
> 
> - Use the "front door" api, or let the developer specify "back door" api
> and secure it:  "A class should be able to expose a serialization protocol
> without exposing that as part of the public API."
> 
> - Move away from "readObject/writeObject" mechanisms. This encodes the
> form of the data in code instead of a form that can be used for automated
> serialization and schema definitions.
> 
> - Backward compatible. My requirement, not yours. It should be possible to
> put together an implementation in Java 11 at minimum.
> 
> Step 1: Define Data.
> 
> Based on those requirements and much more detailed discussion in your
> proposal, the pattern matching constructor/destruction pattern was
> proposed. A reason you're drawn to the ctor/dtor mechanism is that it forms
> an embedding-projection pair. Would it be safe to say that based on this
> and reading between the lines, that what we're after is in fact an
> embedding-projection pair between an Object and Data?
> 
>   e: Data -> Object/Class
>   p: Object/Class -> Data
> 
> If you accept that as the core requirement, then before we go much further
> we better decide what "Data" means. For the purposes of Java language
> design, "Data" is not the byte stream encoding. As the proposal states, and
> I agree, "the stream format is probably the least interesting part of the
> serialization mechanism". Based on this, "Data" is defined as something
> between the encoding and the object. However, if we use your ctor/dtor
> mechanism as the example, then "Data" is the parameter list tuple. We can
> then make a small leap and say that "Data" in Java is an Object[] of values
> and the associated metadata (types, names and order). Once again, reading
> between the lines of what "better serialization" means, I think it is and
> embedded projection pair of:
> 
>  e: Object[] -> Object
>     (Metadata)  (Class)
> 
>  p: Object -> Object[]
>     (Class)   (metadata)
> 
> Note: It's the metadata associated with the Object[] that is really what
> we're after. A serialization protocol could bypass the Object[] in-memory
> representation altogether. The Object[] is just the simplest way to
> represent the tuple in java for the rest of the discussion. The Java
> serialization implementation has the metadata we're talking about
> implemented in the ObjectStreamClass and the ObjectStreamField. So,
> potentially, an aim is to create a better ObjectStreamClass that can be
> used by serialization libraries without that magic it currently contains?
> 
> From this embedded projection pair between Object and Object[] a
> serialization library can come along and add the encoding:
> 
>  Serialization: Object -> Object[]    -> encoding
>                        (class)     (metadata)   (schema)
> 
>  Deserialization: encoding -> Object[]    -> Object.
>                           (schema)      (metadata)    (class)
> 
> On this basis, I've changed the title to "Implementing Towards Better  PEP
> (Projection-Embedded Pairs)", as that's the key concept that can help
> Serialization. There's another side discussion to be had regarding what if
> any restriction could be placed on the elements of the Object[]. I don't
> think it matters, but a class could project data elements in the array that
> can't be serialized.
> 
> Step 2: Possible embedded projection pairs.
> 
> Now that I've been shown the embedded projection pair hammer, everything
> looks like a nail. :) So, using the embedded projection pairs between
> Object[] and Object, what mechanisms can be found to implement it using
> front door APIs:
> 
> Class specified:
>  Constructor:Destructor - An n-arg constructor with n-arg destructor. The
> proposal suggests using this pattern, but it is not available yet.
>  Constructor/Accessors - Available, but potentially difficult to match
> parameters from constructor with accessors.
>  Setters/Getters - Simple, but requires no-args constructor and
> immutability of objects is where a lot of developers are moving.
>  Encapsulated projection - The class has an alternative form and provides
> constructor and accessor for the alternate form. The alternate form
> recursively uses another mechanism listed here. Requires something to
> inform if data is encapsulated or the encapsulation is the data.
> 
> Externally specified:
>  Encapsulated embedding - An external class extracts and embeds a target
> class, with the target class not having defined a direct embedded
> projection pair.
>  Intermediate ep-pair - A third class that provides both projection and
> embedding functions between two other classes.
> 
> There's variations on the above with factory classes and facades etc, but
> they generally can be fit into those categories.
> 
> Step 3: Solve the Constructor/Accessors parameter matching
> 
>>> As I mentioned previously, I've hit a bit of a road-block with my design
> as I'll need a fallback solution for users on Java 11 (current target
> version). An annotation on the ctor or its parameters is the likely
> solution:
> 
>> Yeah,  that’s ugly but doable.  Your users won’t like you.
> 
> This comment rolled around in my head for a little while, so I looked
> closer at the problem. In many cases the classes we're talking about that
> have immutable fields have the following form:
> 
>   public class Point {
>      private final int x;
>      private final int y;
> 
>      public Point( int x, int y ) {
>         this.x = x;
>         this.y = y;
>      }
> 
>      public int x() { return x; }
>      public int y() { return y; }
>   }
> 
> It is pretty clear from our perspective that the constructor parameters
> match up with x,y fields and x,y accessors. However, without the names
> available in the class, reflection doesn't help. If we can prove that
> constructor parameters are invariant before being written to the field, we
> can safely match the constructor to the fields/accessors. So doing some
> deep reflection, we can implement a really simple checking for invariance
> by finding the following patterns and extracting the parameter and field id.
> 
>       4: aload_0
>       5: iload_1
>       6: putfield      #14                 // Field x:I
> 
> This can then be matched up with the accessor:
> 
>       0: aload_0
>       1: getfield      #14                 // Field x:I
>       4: ireturn
> 
> Based on this the above, the class can have its "Data" made available with
> no additional annotations. I'm sure there's plenty of research and
> implementations for testing for parameter invariance that could be applied
> here.
> 
> With a single constructor we don't *need* anything else to say the data can
> be serialised. This bypasses requirement 2, "make serialization explicit",
> but an annotation could be added.
> 
> Step 4. The PEP Proof of Concept (because PEP sounds better than EPP)
> 
> https://github.com/litterat/pep-java
> 
> The proof of concept is designed (still being implemented) to provide five
> of the six mechanisms (obviously dtor is missing) for Object to Object[]
> embedded projection pairs. It also implements the simple check for
> invariance based on above.  The general usage being:
> 
>  // Create an instance object to be projected.
>  Point p1 = new Point(1,2);
> 
>  // Create a context and a descriptor for the target class.
>  PepContext context = new PepContext();
>  PepClassDescriptor pointDescriptor = context.getDescriptor(Point.class);
> 
>  // Extract the values to an array
>  Object[] values = pointDescriptor.project(p1);
> 
>  // Create the object from the values
>  Point p2 = pointDescriptor.embed(values);
> 
> 
> The PepClassDescriptor is logically equivalent to the ObjectStreamClass of
> Java serialization. The PepContext is equivalent to the ObjectStreamClass
> static cache. I've kept it separated so there can be different data
> projections for a class based on the type of communications encoding being
> used.
> 
> The PepClassDescriptor currently includes the project and embed functions,
> but they are logically different things and probably should be separate
> implementations.  As mentioned previously, I'll likely just use the meta
> data as part of the serialization implementation and re-implement without
> the intermediary Object[] later.
> 
> The documentation provided in the project README is currently the design
> document, so if anyone has time/interest to read that I'd be interested in
> feedback.
> 
> Step 5. Implementation
> 
> There's still plenty of work to be done on the proof of concept, but the
> general idea feels like it will work well being separate to the
> serialization library itself. I still think there will be a need for the
> serialization library to add additional annotations which are encoding
> specific, but, if the language can provide the "Data" metadata, half the
> job is done.
> 
> The implementation syntax is likely to change a bit as I get to understand
> how the library interacts with the actual serialization library. I can also
> see some or all the concepts being part of the platform eventually. A week
> ago I was skeptical that a useful separation could be achieved. I can
> envisage a reflection api something like:
> 
>   DataDescriptor data = object.getClass().getDataDescriptor();
>   DataFields[] fields = data.fields();
> 
> Anyway, back to the implementation. Thanks for the discussion and direction
> so far, it has clearly helped.
> 
> Regards,
> David.