Exploring Record Serialization [ was: Records: migration compatibility ]

Wed Jul 24 13:36:50 UTC 2019

TL;DR  class-to-record and record-to-class refactoring is a very
attractive property, which we should explore further in the context of
serialization.

----

Current state of Serialization in amber/amber ( July 2019 ).

- Middle-of-the-road position on records and Serialization.
- Auto generate a `readResolve()` method that pipes the record's state
   through the canonical constructor.
- Advantages: constructor validation checks.
- Disadvantages: deserialization process always creates two record
   instances; may leak "bad" record through back references in the serial
   stream; brittle / fragile

Doubling down on the current approach.

A record's serial-form should be that of its state descriptor. Prohibit
customization of this. Retain the auto generated `readResolve()`, but
also prohibit specifying other Serialization magic methods.
Specifically, prohibit:
  1. explicit `readResolve` / `writeReplace`
  2. explicit `readObject` / `readObjectNoData` / `writeObject`
  3. explicit `serialPersistentFields`
The canonical constructor defends against "bad" data for both the
front-door and back-door APIs.

But can we do better? Should records be a first class citizen in the
Serialization Protocol? ( spoiler: possibly, but probably not )

We could update the Java Object Serialization Specification to provide
explicit support for records (in a similar(ish) way to that of what was
done for the Serialization of Enum Constants).

Support directly within Object Serialization Stream Protocol avoids the
brittleness of the auto-generated magic methods above, and possible
user-interaction or bypasses in the code/implementation. It ensures that
construction is always, and only, performed through the constructor.
The serial format of a record could be the record descriptor and the
record's state. Possible format:

   record-marker record-class record-descriptor field field field ...

Advantages: Simple and clean, less fragile, prevents leaking a "bad"
   record through a back reference in the stream.
Disadvantages: new format incompatible with pre-record releases ( stream
   failure), need to consider compatible record evolution strategy
   ( N-1 problem ) - putting records in the stream protocol requires this
   issue to be given serious consideration now ( crystal ball! ).

We need to have an evolution story if we're going to put records as a
first class citizen in the Serialization protocol. And that story is
somewhat dependent on the general evolution of records.

The N-1 problem : It should be possible for JDK N-1 to deserialize an
object graph that was serialized with JDK N.

The Serialization specification goes to great lengths to specify how
Serializable classes can be compatibly evolved. That said, there are
pitfalls everywhere, and it is incredibly difficult to guarantee that
evolving a Serializable class has been done safely.

Looking at another recent addition to the Serialization protocol -
Enums constants. It is surprising that their serial format is not all
that sympathetic to evolution. Enum constants have an effective format
of `Enum class + string value`. During deserialization,
`Enum.valueOf(class, String value)` is invoked to retrieve the actual
Enum constant.

JLS 13.4.26. Evolution of Enums: _"Adding or reordering constants in an
enum type will not break compatibility with pre-existing binaries"_.
Take 'adding' for example. `java.util.concurrent.TimeUnit`, introduced
in Java 1.5, then a few constants were added in Java 1.6, e.g.
_MINUTES_.

Ok. If a TimeUnit is part of a class's serial-form then, depending on
its actual value, N-1 compatibility may be broken. For example,
fictional `Timeout(long value, TimeUnit unit) implements Serializable`,
serialized with Java 1.6 when the unit is _MINUTES_ will fail to
deserialize with Java 1.5 - fails with `java.io.InvalidObjectException:
enum constant MINUTES does not exist in class 
java.util.concurrent.TimeUnit`,
`Caused by: java.lang.IllegalArgumentException: No enum const class
java.util.concurrent.TimeUnit.MINUTES`.

This is not great. A lot of care needs to be taken if an enum finds its
way into the serial-form of a class, since that enum may be evolved in
the future to contain additional values. While Enum constants having
direct support in the Serialization protocol, operationally
`ObjectInputStream` doesn't handle the N-1 case very gracefully.

Given this, and the myriad of other minefields that evolving a
Serializable class brings (too many to enumerate here), maybe we can
come up with a Serialization format and evolution policy for records,
that will be _no worse_ than that of Enums, or other existing aspects
(minefields) of Serialization compatibility.

Compatible Record Evolution & Migration

Brian has provided details in a prior post on this thread, but it seems
clear that the higher-order bit is migrating from a record-like class to
a record, and migrating from a record to a record-like class ( as
opposed to evolving a record itself ). Wouldn't it be nice if
serialization of these just worked across refactorings?

Given this, then maybe pushing records down into the serialization
format itself is not the way to go. Instead it should be possible to use
the existing standard serialization format to encode the record class
and its component names + values ( just like any other regular
Serializable class ). But rather than having the serialization framework
create the record instance followed by field stuffing, have it locate
the canonical constructor ( or best match constructor ) and invoke it
with the deserialized stream fields. "Best match" here needs a little
more prototyping to determine how best to allow for possible future
evolution of a record, while still being able to deserialize on an N-1
runtime.

Additionally, some level of prohibition or limitation could optionally
still be applied to the serialization magic methods, to preserve and
restrict, by default, the stream fields to that of just the record's
state. There are various options here ranging from an error/warning
during compilation, to the serialization framework specifying that it
effectively ignores these magic methods for records.

-Chris.

On 23/07/2019 19:32, Brian Goetz wrote:
> In the course of exploring serialization support for records, Chris 
> asked about the compatible evolution modes for records.  We have 
> explored this briefly before but let's put this down in one place.
> 
> Since we are saying that records are a lot like enums, let's start with:
> 
>   A. Migrating a record-like class to a record
>   B. Migrating a record to a record-like class
> 
> (which is analogous to refactoring between an enum and a class using the 
> type-safe enum pattern.)
> 
> Migration A should be both source- and binary- compatible, provided the 
> original class has all the members the record would have -- ctor, dtor, 
> accessors.  Which in turn requires being able to declare the members, 
> including dtor, but we'll come back to that.
> 
> What about serialization compatibility?  It depends on our serialization 
> story (Chris will chime in with more here), but its fair to note that 
> while migrating from a TSE to an enum is not serialization compatible 
> either.
> 
> Migration B is slightly more problematic (for both records and enums), 
> as a record will extend Record just as enums extend Enum. Which means 
> casting to, or invoking Record methods on, a migrated record would 
> fail.  (Same is true for enums.)  Again, I'll leave it to Chris to fill 
> in the serialization compatibility story; we have a variety of possible 
> approaches there.
> 
> What about changing the descriptor of a record?
> 
>   C.  Removing components
>   D.  Reordering components
>   E.  Adding components
> 
> Removals of all sorts are generally not source- or binary- compatible; 
> removing components will cause public members to disappear and 
> constructors to change their signatures.  So we should have no 
> compatibility expectations of C.
> 
> D will cause the signature of the canonical ctor and dtor to change.  If 
> the types of the permuted components are different, it may be possible 
> for the author to explicitly implement the old ctor/dtor signature, so 
> that the existing set of members is preserved.  However, I think we 
> should describe this as not being a compatible migration, even if it is 
> possible (in some cases) to make up the difference.
> 
> E is like D, in that it is possible to add back the old ctor/dtor 
> implementations, and rescue existing callsites, but I think it should be 
> put in the same category.
> 
>