Implementing Towards Better PEP/Serialization

Mon Aug 17 11:31:51 UTC 2020

Hi Brian & Remi,

Thanks for the feedback, both have really been useful and lead me to a bit
of an aha moment! So apologies for another long message. From your emails,
two things jumped out at me and the reason will become clearer later. From
Remi...

> I prefer your original idea that a record class mirrors the schema so a
record instance is conceptually equivalent to a list of pairs of key/values.

And from Brian...

> To clarify: this is mostly a philosophical statement; data is “dead”, and
objects are “live”.  Reconstructing a live object through some sort of
star-trek transporter device is cool, but more that what 99.999% of
applications need, and costs a lot more.  Data can be put in a fedex box.

Also, for the rest of the message I'm going to try and be more specific
about conversions instead of using project/embed. It saves me from being
confused about which projects and embeds. :)

1. The mechanics

I was thinking about the POCs that I put together and the different
mechanisms it uses. I've mentioned a few of them previously, but I'll
review.

1.1 Mapping - Converts from Object to Data.

In the POC, I converted the Object to Object[], however, this could just as
easily have been a Map. Implementation of going from Object to Object[]...

Object[] toArray( Object object ) {

   DataMetadata metadata = getMetadata( object.getClass() );
   Object data = convertToData.invoke(object);
   Object[] output = new Object[ metadata.fields().length() ];
   for ( int x=0; x< metadata.fields().length; x++ ) {
       DataField field = metadata.fields()[x];
       output[x] = field.accessor.invoke(data);
   }
   return output;
}

And as toMap function...

Map<String,Object> toMap( Object object ) {

   DataMetadata metadata = getMetadata( object.getClass() );
   Object data = convertToData.invoke(object);
   Map<String,Object> output = new HashMap<>();
   for ( int x=0; x< metadata.fields().length; x++ ) {
       DataField field = metadata.fields()[x];
       output.put(field.name(), field.accessor.invoke(data) );
   }
   return output;
}

The same design can also be used to write to Serialization encoding, a DOM,
or any other data representation. The point is, this isn't really about
"Towards better serialization", it is about going from "live" objects to
"dead" data structures.

1.2 Metadata - Information about the data.

The mapper requires information about the data. Specifically,

  - How to extract the Data from the object with a possible conversion
function (see 1.3)
  - The fields metadata in the Data (types, names, accessors) and a way to
access it.
  - A way to construct the Data object and set values before passing to
convert to object function.
  - A cache of the configuration (this is just for performance if it takes
time to get the metadata or has specific extra configurations)

1.3 Convert to Data - Extract the Data from the Object.

I noted in the last message the following embedded projection pairs to do
the extraction to/from object and data class. (once again
projection/embedding renamed to try and be more explicit). I've separated
the embedded projection pairs list into conversion and
constructor/destructor which will also become clear later.

Encapsulated produces data - The class has an alternative form and provides
constructor and accessor for the alternate form. The alternate form being a
data object.
Encapsulated produces "live" object - A data class that extracts and
produces from a target class, with the target class not having defined a
direct way to produce data.
Intermediate ep-pair - A third class that provides both projection and
embedding functions between two other classes (one Data and one live).
Identity - Not in last message. But this is where the target class is also
the data class. Adding this you can always call the conversion function.

The interesting property is that all of these reduce down to a single
signature. In the mapping, a single MethodHandle can be called easily.

  convertToData: DataClass toData( TargetClass )
  convertToObject: targetClass toObject( DataClass )

1.4 Accessors and Constructors

In the last message I noted the following embedded projection pairs of
extracting fields and creating objects:

Constructor:Destructor - An n-arg constructor with n-arg destructor. The
proposal suggests using this pattern, but it is not available yet.
Constructor/Accessors - Available, but potentially difficult to match
parameters from constructor with accessors.
Setters/Getters - Simple, but requires no-args constructor and immutability
of objects is where a lot of developers are moving.

I left off the following:

Records - Similar to the immutable Constructor/Accessors with the important
point that the platform provides field meta data.

2. What is Data

I was looking at Remi's example code and it got me thinking about what we
mean about data again. He provided the sample...

  record Point(int x, int y) implements Serializable { }

  @Marshall(deconstruct = Point.class, reconstructs = Point.class)
  static class MutablePoint {
    int x;
    int y;

    public MutablePoint(int x, int y) {
      this.x = x;
      this.y = y;
    }

    private Point deconstructor() {
      return new Point(x, y);
    }
    private static MutablePoint reconstructor(Point point) {
      return new MutablePoint(point.x, point.y);
    }
  }

While this was just an example, my initial reaction was that MutablePoint
is Data, so adding a destructor to create a record is superfluous in many
cases (although, I realise that was not the point of the example). However,
the Record has all the properties we are after (accessors, type information
and field names). So this is where my aha! Moment arrived. The Record is
the embodiment of "dead" data in Java.

Going back to the mapper function, if we used records it provides the meta
data and accessors.

Object[] toArray( Object object ) {
   RecordComponent[] components = object.getClass().getRecordComponents();
   Object data = convertToData.invoke(object); // Something extra here.
   Object[] output = new Object[ components.length() ];
   for ( int x=0; x< metadata.fields().length; x++ ) {
       RecordComponent component = components[x];
       output[x] = component.getAccessor().invoke(data);
   }
   return output;
}

However, we have these other Data classes that are currently not Records.
POJO (setters/getters), hand coded Immutables (constructor/accessors) and
proposed (constructor/destructor). But maybe we could make these look like
Records?

3. Proposal

Instead of talking about @Serializer/@Deserializer or @Marshall, could this
be reframed as @Data. For example,

@Data( version=1 )
class MutablePoint {
    private int x;
    private int y;

    ... getters/setters ...
}

This informs the compiler that this can be treated like a record for
reflection. It could potentially create a synthetic constructor. Reflection
getRecordComponents() would return x and y as RecordComponents with the
accessor() returning the getX()/getY() methods. Similarly, an immutable
implementation is already a Record without the name:

@Data
class ImmutablePoint {
   Private final int x;
   Private final int y;

   ImmutablePoint(int x, int y) {
     this.x = x;
     this.y = y;
   }

   ... accessors ...
}

By telling the compiler that this is in fact @Data, the compiler can
provide the getRecordComponents() meta data.

So now there's Data classes and non-data classes. As mentioned there's
potential for these to produce Data classes. A class that produces/consumes
a Record might look like:

 record Point(int x, int y) { }

 class MutablePoint {
    int x;
    int y;

    public MutablePoint(int x, int y) {
      this.x = x;
      this.y = y;
    }

    @Data( externalizes=Point.class )
    Public MutablePoint( Point point ) {
      this.x = point.x();
      this.y = point.y();
    }

    private Point toPoint() {
      return new Point(x, y);
    }
  }

I was tempted to use @Record for the example, but that has potentially
other associations. Also, I don't want to get caught up in if an annotation
is the right way to represent this, the key to the proposal the following
points:

 - What we're working towards is not really about serialization, it is
about ep-pairs between Data ("dead" objects) and "live" objects.
 - A Record is currently the closest thing in Java to Data ("dead" objects).
 - The metadata required for serialization is the same as what
getRecordComponents provides, so why re-invent the capability.
 - Unfortunately Data is not restricted to Record classes, however, if we
can Duck type a class to a Record, we've got the information we need for
serialization and plenty of other purposes.
 - Live objects that can't be Duck typed to a Record can optionally provide
pe-pairs to export Data classes.
 - Many serialization libraries already ignore Serializable, but a Data
class would be useful across different encodings and serialization
libraries.

As a final thought, a Record class could potentially be extended from a
Data class and Class.getRecordComponents is actually
Class.getDataComponents. A developer could extend/implement Data to tell
the compiler to provide getDataComponents via reflection and synthetic
constructor if not already present. That leaves a Record as is but lets
other classes look like Data too.

Regards,
David.

On Mon, Aug 17, 2020 at 12:28 AM Brian Goetz <brian.goetz at oracle.com> wrote:

> > Re-stating the summarized requirements based on elements of our
> discussion and the original proposal.
> >
> >  - Data not objects: “applications use serialization to persist data, or
> to exchange data with other applications. Not objects; data.” And "We
> should seek to support serialization of data, not objects.”
>
> To clarify: this is mostly a philosophical statement; data is “dead”, and
> objects are “live”.  Reconstructing a live object through some sort of
> star-trek transporter device is cool, but more that what 99.999% of
> applications need, and costs a lot more.  Data can be put in a fedex box.
>
> > Step 1: Define Data.
> >
> > Based on those requirements and much more detailed discussion in your
> proposal, the pattern matching constructor/destruction pattern was
> proposed. A reason you're drawn to the ctor/dtor mechanism is that it forms
> an embedding-projection pair. Would it be safe to say that based on this
> and reading between the lines, that what we're after is in fact an
> embedding-projection pair between an Object and Data?
>
> >    e: Data -> Object/Class
> >    p: Object/Class -> Data
>
> I always get confused about which direction is embedding and which is
> projection, so I always have to work it out from first principles :)  You
> project from the “big” domain to the small one (The 2D shadow of a 3D
> object is a projection), and embed from the small domain into the big one.
> The big domain may have values that don’t work in the small domain, but not
> vice versa.  On that basis, the object domain is really the small one, so
> you’ve got e and p backwards.
>
> Motivating example: Rational and (int num, int denom).  Every valid
> rational can be embedded in (int, int), and you can take the resulting
> value and map it back to Rational.  But you can’t take an arbitrary (int,
> int) and map it to Rational; if denom=0, that doesn’t fit.  So (int, int)
> is the big domain here, and we embed the object into the data.  But this is
> just terminology; it doesn’t affect your point.
>
> (It sounds weird because we’re used to thinking about objects as being
> richer, but in terms of information content, they’re not.)
>
> > If you accept that as the core requirement, then before we go much
> further we better decide what "Data" means. For the purposes of Java
> language design, "Data" is not the byte stream encoding. As the proposal
> states, and I agree, "the stream format is probably the least interesting
> part of the  serialization mechanism". Based on this, "Data" is defined as
> something between the encoding and the object. However, if we use your
> ctor/dtor mechanism as the example, then "Data" is the parameter list
> tuple. We can then make a small leap and say that "Data" in Java is an
> Object[] of values and the associated metadata (types, names and order).
> Once again, reading between the lines of what "better serialization" means,
> I think it is and embedded projection pair of:
>
> Using Object[] is a reasonable choice for a data representation.  More
> formally, you’re likely to build some sort of tree; Object[] lets us build
> this tree in a more dynamically typed manner.  A more statically typed
> schema might be:
>
>     P = int | long | double | float | string      // primitives
>     D = P | record(D*)
>
> That is, you start with primitive representations of Java’s numeric
> primitives and strings, and then introduce a combinator that lets you
> define tuples of things you know to be data.  So
>
>     3
>     (3, 3)
>     (3, (3, 3))
>     …
>
> can all be described by this format.
>
> > Note: It's the metadata associated with the Object[] that is really what
> we're after. A serialization protocol could bypass the Object[] in-memory
> representation altogether. The Object[] is just the simplest way to
> represent the tuple in java for the rest of the discussion. The Java
> serialization implementation has the metadata we're talking about
> implemented in the ObjectStreamClass and the ObjectStreamField. So,
> potentially, an aim is to create a better ObjectStreamClass that can be
> used by serialization libraries without that magic it currently contains?
>
> One obvious option is to stick the metadata into the Object[] itself, such
> as a leading element that says “the next three elements are a Foo”.
> Another is to attach the metadata to nominal record types as in the more
> structured approach.
>
> > On this basis, I've changed the title to "Implementing Towards Better
> PEP (Projection-Embedded Pairs)", as that's the key concept that can help
> Serialization. There's another side discussion to be had regarding what if
> any restriction could be placed on the elements of the Object[]. I don't
> think it matters, but a class could project data elements in the array that
> can't be serialized.
>
> Well, p-e pairs are more general than serialization.  For example, you can
> think of covariant overrides as applying a p-e pair to the return value
> (and you could do the reverse contravariantly for parameters.)  If you
> squint, you’ll see this in action in the behavior of
> MethodHandles::asType.  If we had p-e pairs as a primitive, then we can
> build serialization on top of that.  But so many other things too.
>
> > Step 2: Possible embedded projection pairs.
> >
> > Now that I've been shown the embedded projection pair hammer, everything
> looks like a nail. :)
>
> Yep :)
>
> > So, using the embedded projection pairs between Object[] and Object,
> what mechanisms can be found to implement it using front door APIs:
> >
> > Class specified:
> >   Constructor:Destructor - An n-arg constructor with n-arg destructor.
> The proposal suggests using this pattern, but it is not available yet.
> >   Constructor/Accessors - Available, but potentially difficult to match
> parameters from constructor with accessors.
> >   Setters/Getters - Simple, but requires no-args constructor and
> immutability of objects is where a lot of developers are moving.
> >   Encapsulated projection - The class has an alternative form and
> provides constructor and accessor for the alternate form. The alternate
> form recursively uses another mechanism listed here. Requires something to
> inform if data is encapsulated or the encapsulation is the data.
> >
> > Externally specified:
> >   Encapsulated embedding - An external class extracts and embeds a
> target class, with the target class not having defined a direct embedded
> projection pair.
> >   Intermediate ep-pair - A third class that provides both projection and
> embedding functions between two other classes.
> >
> > There's variations on the above with factory classes and facades etc,
> but they generally can be fit into those categories.
>
> This seems to cover most of the landscape.  And you only need one.  The
> challenge is that one size probably does not fit all.  Your last category
> is a good observation; there are many classes which provide enough access
> to their state to be serializable, but are not, in fact, serializable.
> Being able to “bring your own schema” is a useful move.  Again, PE pairs
> offer a nice framework for representing this; if I can define a projection
> to a domain that is serializable, and an embedding back, I’m good.  If my
> almost-serializable class is C, then this is:
>
>     e: C -> X
>     p: X -> C
>
> where X is some form known to be serializable (like a record.)  It is a
> nice bonus that C need not know about e, p, or X.
>
> Legacy serialization attempts to project objects into the legacy stream
> format, but unfortunately the embedding is defective; if we have a bad
> stream, we don’t detect this, we just make potentially bad objects.  Going
> back through the constructor allows us to avoid this defect.
>
> > This comment rolled around in my head for a little while, so I looked
> closer at the problem. In many cases the classes we're talking about that
> have immutable fields have the following form:
> >
> >    public class Point {
> >       private final int x;
> >       private final int y;
> >
> >       public Point( int x, int y ) {
> >          this.x = x;
> >          this.y = y;
> >       }
> >
> >       public int x() { return x; }
> >       public int y() { return y; }
> >    }
> >
> > It is pretty clear from our perspective that the constructor parameters
> match up with x,y fields and x,y accessors. However, without the names
> available in the class, reflection doesn't help. If we can prove that
> constructor parameters are invariant before being written to the field, we
> can safely match the constructor to the fields/accessors. So doing some
> deep reflection, we can implement a really simple checking for invariance
> by finding the following patterns and extracting the parameter and field id.
>
> … and without some sort of signal from the author, guessing that these
> names describe the same thing is a bit of a leap of faith.  This is
> something a third-party serialization library could get away with, that the
> JDK could not.
>
>
> Cheers,
> -Brian
>
>
>