Implementing Towards Better PEP/Serialization

Sun Aug 16 14:28:37 UTC 2020

> Re-stating the summarized requirements based on elements of our discussion and the original proposal.
> 
>  - Data not objects: “applications use serialization to persist data, or to exchange data with other applications. Not objects; data.” And "We should seek to support serialization of data, not objects.”

To clarify: this is mostly a philosophical statement; data is “dead”, and objects are “live”.  Reconstructing a live object through some sort of star-trek transporter device is cool, but more that what 99.999% of applications need, and costs a lot more.  Data can be put in a fedex box.  

> Step 1: Define Data.
> 
> Based on those requirements and much more detailed discussion in your proposal, the pattern matching constructor/destruction pattern was proposed. A reason you're drawn to the ctor/dtor mechanism is that it forms an embedding-projection pair. Would it be safe to say that based on this and reading between the lines, that what we're after is in fact an embedding-projection pair between an Object and Data?

>    e: Data -> Object/Class
>    p: Object/Class -> Data

I always get confused about which direction is embedding and which is projection, so I always have to work it out from first principles :)  You project from the “big” domain to the small one (The 2D shadow of a 3D object is a projection), and embed from the small domain into the big one.  The big domain may have values that don’t work in the small domain, but not vice versa.  On that basis, the object domain is really the small one, so you’ve got e and p backwards.  

Motivating example: Rational and (int num, int denom).  Every valid rational can be embedded in (int, int), and you can take the resulting value and map it back to Rational.  But you can’t take an arbitrary (int, int) and map it to Rational; if denom=0, that doesn’t fit.  So (int, int) is the big domain here, and we embed the object into the data.  But this is just terminology; it doesn’t affect your point.  

(It sounds weird because we’re used to thinking about objects as being richer, but in terms of information content, they’re not.)  

> If you accept that as the core requirement, then before we go much further we better decide what "Data" means. For the purposes of Java language design, "Data" is not the byte stream encoding. As the proposal states, and I agree, "the stream format is probably the least interesting part of the  serialization mechanism". Based on this, "Data" is defined as something between the encoding and the object. However, if we use your ctor/dtor mechanism as the example, then "Data" is the parameter list tuple. We can then make a small leap and say that "Data" in Java is an Object[] of values and the associated metadata (types, names and order). Once again, reading between the lines of what "better serialization" means, I think it is and embedded projection pair of:

Using Object[] is a reasonable choice for a data representation.  More  formally, you’re likely to build some sort of tree; Object[] lets us build this tree in a more dynamically typed manner.  A more statically typed schema might be: 

    P = int | long | double | float | string      // primitives
    D = P | record(D*)

That is, you start with primitive representations of Java’s numeric primitives and strings, and then introduce a combinator that lets you define tuples of things you know to be data.  So

    3
    (3, 3)
    (3, (3, 3))
    …

can all be described by this format.  

> Note: It's the metadata associated with the Object[] that is really what we're after. A serialization protocol could bypass the Object[] in-memory representation altogether. The Object[] is just the simplest way to represent the tuple in java for the rest of the discussion. The Java serialization implementation has the metadata we're talking about implemented in the ObjectStreamClass and the ObjectStreamField. So, potentially, an aim is to create a better ObjectStreamClass that can be used by serialization libraries without that magic it currently contains?

One obvious option is to stick the metadata into the Object[] itself, such as a leading element that says “the next three elements are a Foo”.  Another is to attach the metadata to nominal record types as in the more structured approach.  

> On this basis, I've changed the title to "Implementing Towards Better  PEP (Projection-Embedded Pairs)", as that's the key concept that can help Serialization. There's another side discussion to be had regarding what if any restriction could be placed on the elements of the Object[]. I don't think it matters, but a class could project data elements in the array that can't be serialized.

Well, p-e pairs are more general than serialization.  For example, you can think of covariant overrides as applying a p-e pair to the return value (and you could do the reverse contravariantly for parameters.)  If you squint, you’ll see this in action in the behavior of MethodHandles::asType.  If we had p-e pairs as a primitive, then we can build serialization on top of that.  But so many other things too.  

> Step 2: Possible embedded projection pairs.
> 
> Now that I've been shown the embedded projection pair hammer, everything looks like a nail. :)

Yep :)

> So, using the embedded projection pairs between Object[] and Object, what mechanisms can be found to implement it using front door APIs:
> 
> Class specified:
>   Constructor:Destructor - An n-arg constructor with n-arg destructor. The proposal suggests using this pattern, but it is not available yet.
>   Constructor/Accessors - Available, but potentially difficult to match parameters from constructor with accessors.
>   Setters/Getters - Simple, but requires no-args constructor and immutability of objects is where a lot of developers are moving.
>   Encapsulated projection - The class has an alternative form and provides constructor and accessor for the alternate form. The alternate form recursively uses another mechanism listed here. Requires something to inform if data is encapsulated or the encapsulation is the data.
> 
> Externally specified:
>   Encapsulated embedding - An external class extracts and embeds a target class, with the target class not having defined a direct embedded projection pair.
>   Intermediate ep-pair - A third class that provides both projection and embedding functions between two other classes.
> 
> There's variations on the above with factory classes and facades etc, but they generally can be fit into those categories.

This seems to cover most of the landscape.  And you only need one.  The challenge is that one size probably does not fit all.  Your last category is a good observation; there are many classes which provide enough access to their state to be serializable, but are not, in fact, serializable.  Being able to “bring your own schema” is a useful move.  Again, PE pairs offer a nice framework for representing this; if I can define a projection to a domain that is serializable, and an embedding back, I’m good.  If my almost-serializable class is C, then this is:

    e: C -> X
    p: X -> C

where X is some form known to be serializable (like a record.)  It is a nice bonus that C need not know about e, p, or X.  

Legacy serialization attempts to project objects into the legacy stream format, but unfortunately the embedding is defective; if we have a bad stream, we don’t detect this, we just make potentially bad objects.  Going back through the constructor allows us to avoid this defect.  

> This comment rolled around in my head for a little while, so I looked closer at the problem. In many cases the classes we're talking about that have immutable fields have the following form:
> 
>    public class Point {
>       private final int x;
>       private final int y;
> 
>       public Point( int x, int y ) {
>          this.x = x;
>          this.y = y;
>       }
> 
>       public int x() { return x; }
>       public int y() { return y; }
>    }
> 
> It is pretty clear from our perspective that the constructor parameters match up with x,y fields and x,y accessors. However, without the names available in the class, reflection doesn't help. If we can prove that constructor parameters are invariant before being written to the field, we can safely match the constructor to the fields/accessors. So doing some deep reflection, we can implement a really simple checking for invariance by finding the following patterns and extracting the parameter and field id.

… and without some sort of signal from the author, guessing that these names describe the same thing is a bit of a leap of faith.  This is something a third-party serialization library could get away with, that the JDK could not.  

Cheers,
-Brian