One final stab at improving lambda serialization

Wed Aug 21 19:25:39 PDT 2013

I agree. Looks like this proposal strikes just the right balance: it is not being unnecessarily (and unexpectedly) restrictive while avoiding "silent" failures. I am supportive of serialization behavior implemented as described below.

Thank you,
Vlad

The Goldman Sachs Group, Inc. All rights reserved.
See http://www.gs.com/disclaimer/global_email for important risk disclosures, conflicts of interest and other terms and conditions relating to this e-mail and your reliance on information contained in it.  This message may contain confidential or privileged information.  If you are not the intended recipient, please advise us immediately and delete this message.  See http://www.gs.com/disclaimer/email for further information on confidentiality and the risks of non-secure electronic communication.  If you cannot access these links, please notify us by reply message and we will send the contents to you.

-----Original Message-----
From: lambda-libs-spec-experts-bounces at openjdk.java.net [mailto:lambda-libs-spec-experts-bounces at openjdk.java.net] On Behalf Of Doug Lea
Sent: Monday, August 19, 2013 1:44 PM
To: Brian Goetz
Cc: lambda-libs-spec-experts at openjdk.java.net; lambda-spec-experts at openjdk.java.net
Subject: Re: One final stab at improving lambda serialization

This is the sort of scheme I had in mind in my reply to David Lloyd
(that might have in part led to this effort). And the details seem
reasonable to me. Looks good to go, assuming no unexpected
implementation snags.

-Doug

On 08/19/2013 11:37 AM, Brian Goetz wrote:
> *Background*
>
> The fundamental challenge with serialization is that the code that defined a
> class at serialization time may have changed by the time deserialization
> happens.  Serialization is defined to be tolerant of change to a certain extent,
> and admits a degree of customization to allow additional flexibility.
>
> For ordinary classes, there are three lines of defense:
>
>   * serialVersionUID
>   * serialization hooks
>   * default schema evolution
>
> Serial version UID for the target class must match exactly.  By default,
> serialization uses a serial version UID which is a hash of the classes
> signatures.  So this default approach means "any significant change to the
> structure of the class (adding new methods, changing method or field signatures,
> etc) renders serialized forms invalid".  It is a common practice to explicitly
> assign a serial version UID to a class, thereby disabling this mechanism.
>
> Classes that expect to evolve over time may use readObject/writeObject and/or
> readResolve/writeReplace to customize the mapping between object state and
> bytestream.  If classes do not use this mechanism, serialization uses a default
> schema evolution mechanism to adjust for changes in fields between serialization
> and deserialization time; fields that are present in the bytestream but not in
> the target class are ignored, and fields that are present in the target class
> but not the bytestream get default values (zero, null, etc.)
>
> Anonymous classes follow the same approach and have access to the same
> mechanisms (serialVersionUID, read/writeObject, etc), but they have two
> additional sources of instability:
>
>   * The name is generated as EnclosingClass$nnn.  Any change to the set of
>     anonymous classes in the enclosing class may cause sequence numbers to change.
>   * The number and type of fields (appears in bytecode but not source code) are
>     generated based on the set of captured values. Any change to the set or
>     order of captured values can cause these signatures to change (in an
>     unspecified way).
>
> If the signatures remain stable, anonymous classes can use serialization hooks
> to customize the serialized form, just like named classes.
>
> The EG has observed that users have largely learned to deal with the problems of
> serialization of inner classes, either by (a) don't do it, or (b) ensure that
> essentially the same bits are present on both sides of the pipe, preventing skew
> from causing instability in either class names or signatures.
>
> The EG has set, as a minimum bar, that lambda serialization be "at least as good
> as" anonymous class serialization.  (This is not a high bar.)  Further, the EG
> has concluded that gratuitous deviations from anonymous class serialization are
> undesirable, because, if users have to deal with an imperfect scheme, having
> them deal with something that is basically the same as an imperfect scheme
> they've already gotten used to is preferable to dealing with a new and
> different  scheme.
>
> Further, the EG has rejected the idea of arbitrarily restricting access to
> serialization just because it is dangerous; users who have learned to use it
> safely should not be unduly encumbered.
>
> *Failure modes
> *
>
> For anonymous classes, one of two things will happen when attempting to
> deserialize after things have changed "too much":
>
>  1. A deserialization failure due to either the name or signature not matching,
>     resulting in NoSuchMethodError, IncompatibleClassChangeError, etc.
>  2. Deserializing to the wrong thing, without any evidence of error.
>
> Obviously, a type-2 failure is far worse than a type-1 failure, because no error
> is raised and an unintended computation is performed.  Here are two examples of
> changes that are behaviorally compatible but which will result in type-2
> failures.  The first has to do with order-of-declaration.
>
> *Old code**
> *     *New code**
> *     *Result**
> *
> Runnable r1 = new Runnable() {
>      void run() {
>          System.out.println("one");
>      }
> };
> Runnable r2 = new Runnable() {
>      void run() {
>          System.out.println("two");
>      }
> };
>       Runnable r2 = new Runnable() {
>      void run() {
>          System.out.println("two");
>      }
> };
> Runnable r1 = new Runnable() {
>      void run() {
>          System.out.println("one");
>      }
> };
>       Deserialized r1 (across skew) prints "two".
>
> This fails because in both cases, we get classes called Foo$1 and Foo$2, but in
> the old code, these correspond to r1 and r2, but in the new code, these
> correspond to r2 and r1.
>
> The other failure has to do with order-of-capture.
>
> *Old code**
> *     *New code**
> *     *Result**
> *
> String s1 = "foo";
> String s2 = "bar";
> Runnable r = new Runnable() {
>      void run() {
> foo(s1, s2);
>      }
> };
>
>       String s1 = "foo";
> String s2 = "bar";
> Runnable r = new Runnable() {
>      void run() {
>          String s = s2;
> foo(s1, s);
>      }
> };
>       On deserialization, s1 and s2 are effectively swapped.
>
> This fails because the order of arguments in the implicitly generated
> constructor of the inner class changes due to the order in which the compiler
> encounters captured variables.  If the reordered variables were of different
> types, this would cause a type-1 failure, but if they are the same type, it
> causes a type-2 failure.
>
> *User expectations*
>
> While experienced users are quick to state the "same bits on both sides" rule
> for reliable deserialization, a bit of investigation reveals that user
> expectations are actually higher than that.  For example, if the compiler
> generated a /random/ name for each lambda at compile time, then recompiling the
> same source with the same compiler, and using the result for deserialization,
> would fail.  This is too restrictive; user expectations are not tied to "same
> bits", but to a vaguer notion of "I compiled essentially the same source with
> essentially the same compiler, and therefore didn't change anything
> significant."  For example, users would balk if adding a comment or changing
> whitespace were to affect deserialization.  Users likely expect (in part, due to
> behavior of anonymous classes) changes to code that doesn't affect the lambda
> directly or indirectly (e.g., add or remove a debugging println) also would not
> affect the serialized form.
>
> In the absence of the user being able to explicitly name the lambda /and/ its
> captures (as C++ does), there is no perfect solution.  Instead, our goal can
> only be to minimize type-2 failures while not unduly creating type-1 failures
> when "no significant code change" happened.  This means we have to put a stake
> in the ground as to what constitutes "significant" code change.
>
> The de-facto (and likely accidental) definition of "significant" used by inner
> classes here is:
>
>   * Adding, removing, or reordering inner class instances earlier in the source
>     file;
>   * Changes to the number, order, or type of captured arguments
>
> This permits changes to code that has nothing to do with inner classes, and many
> common refactorings as long as they do not affect the order of inner class
> instances or their captures.
>
> *Current Lambda behavior*
>
> Lambda serialization currently behaves very similarly to anonymous class
> serialization.  Where anonymous classes have stable method names but unstable
> class names, lambdas are the dual; unstable method names but stable class
> names.  But since both are used together, the resulting naming stability is
> largely the same.
>
> We do one thing to increase naming stability for lambdas: we hash the name and
> signature of the enclosing method in the lambda name. This insulates lambda
> naming from the addition, removal, or reordering of methods within a class file,
> but naming stability remains sensitive to the order of lambdas within the
> method. Similarly, order-of-capture issues are largely similar to inner classes.
>
> Lambdas bodies are desugared to methods named in the following form:
> lambda$/mmm/$/nnn/, where /mmm/ is a hash of the method name and signature, and
> /nnn/ is a sequence number of lambdas that have the same /mmm/ hash.
>
> Because lambdas are instantiated via invokedynamic rather than invoking a
> constructor directly, there is also slightly more leniency to changes to the
> /types/ of captured argument; changing a captured argument from, say, String to
> Object, would be a breaking change for anonymous classes (it changes the
> constructor signature) but not for lambdas.  This leniency is largely an
> accidental artifact of translation, rather than a deliberate design decision.
>
> *Possible improvements*
>
> We can start by recognizing the role of the hash of the enclosing method in the
> lambda method name.  This reduces the set of lambdas that could collide from
> "all the lambdas in the file" to "all the lambdas in the method."  This reduces
> the set of changes that cause both type-1 and type-2 errors.
>
> An additional observation is that there is a tension between trying to /recover
> from/ skew (rather than simply trying to detect it, and failing deserialization)
> and complexity.  So I think we should focus primarily on detecting skew and
> failing deserialization (turning type-2 failures into type-1) while at the same
> time not unduly increasing the set of changes that cause type-1 errors, with the
> goal of settling on an informal guideline of what constitutes "too much" change.
>
> We can do this by increasing the number of things that affect the /mmm/ hash,
> effectively constructing the lambda-equivalent of the serialization version
> UID.  The more context we add to this hash, the smaller the set of lambdas that
> hash to the same bucket gets, which reduces the space of possible collisions.
> The following table shows possible candidates for inclusion, along with examples
> of code that illustrate dependence on this item.
>
> *Item**
> *     *Old Code**
> ------------------------------
> *     *New Code**
> **------------------------------*
>       *Effect**
> *     *Rationale**
> *
> Names of captured arguments
>       int x = ...
> f(() -> x);
>       int y = ...
> f(() -> y);   Including the names of captured arguments in the hash would cause
> rename-refactors of captured arguments to be considered a serialization-breaking
> change.
>       While alpha-renaming is generally considered to be semantic-preserving,
> serialization has always keyed off of names (such as field names) as being clues
> to developer intent.  It seems reasonable to say "If you change the names
> involved, we have to assume a semantic change occurred."  We cannot tell if a
> name change is a simple alpha-rename or capturing a completely different
> variable, so this is erring on the safe side.
> Types of captured arguments
>       String x = ...
> f(() -> x);   Object x = ...
> f(() -> x);
>       It seems reasonable to say that, if you capture arguments of a different type,
> you've made a semantic change.
> Order of captured arguments
>       () -> {
>      int a = f(x);
>      int b = g(y);
>      return h(a,b);
> };
>       () -> {
>      int b = g(y);
>      int a = f(x);
>      return h(a,b);
> };    Changing the order of capture would become a type-1 failure rather than
> possibly a type-2 failure.
>       Since we cannot detect whether the ordering change is semantically meaningful
> or not, it is best to be conservative and say: change to capture order is likely
> a semantic change.
> Variable assignment target (if present)
>       Runnable r1 = Foo::f;
> Runnable r2 = Foo::g;
>       Runnable r2 = Foo::g;
> Runnable r1 = Foo::f;
>
>       Including variable target name would render this reordering recoverable and correct
>       If the user has gone to the effort of providing a name, we can use this as a
> hint to the meaning of the lambda.
>
>       Runnable r = Foo::f;    Runnable runnable = Foo::f;     Including variable target
> name would render this change (previously recoverable and correct) a
> deserialiation failure
>       If the user has changed the name, it seems reasonable to treat that as possibly
> meaning something else.
> Target type
>       Predicate<String> p = String::isEmpty;
>       Function<String, Boolean> p = String::isEmpty;  Including target type reduces
> the space of potential sequence number collisions.
>       If you've changed the target type, it is a different lambda.
>
> This list is not exhaustive, and there are others we might consider.  (For
> example, for lambdas that appear in method invocation context rather than
> assignment context, we might include the hash of the invoked method name and
> signature, or even the parameter index or name.  This is where it starts to
> exhibit diminishing returns and increasing brittleness.)
>
> Taken in total, the effect is:
>
>   * All order-of-capture issues become type-1 failures, rather than type-2
>     failures (modulo hash collisions).
>   * Order of declaration issues are still present, but they are dramatically
>     reduced, turning many type-2 failures into type-1 failures.
>   * Some new type-1 failures are introduced, mostly those deriving from
>     rename-refactors.
>
> The remaining type-2 failures could be dealt with if we added named lambdas in
> the future.  (They are also prevented if users always assign lambdas to local
> variables whose names are unique within the method; in this way, the
> local-variable trick becomes a sort of poor-man's named lambda.)
>
> We can reduce the probability of collision further by using a different (and
> simpler) scheme for non-serializable lambdas (lambda$nnn), so that serializable
> lambdas can only accidentally collide with each other.
>
> However, there are some transformations which we will still not be able to avoid
> under this scheme.  For example:
>
> *Old code**
> *     *New code**
> *     *Result**
> *
> Supplier<Integer> s =
> foo ? () -> 1
>          : () -> 2;
>       Supplier<Integer> s =
> !foo ? () -> 2
>           : () -> 1;  This change is behaviorally compatible but could result in
> type-2 failure, since both lambdas have the same target type, capture arity, etc.
>
> However^2, we can still detect this risk and warn the user.  If for any /mmm/,
> we issue more than one sequence number /nnn/, we are at risk for a type-2
> failure, and can issue a lint warning in that case, suggesting the user refactor
> to something more stable.  (Who knows what that diagnostic message will look
> like.) With all the hash information above, it seems likely that the number of
> potentially colliding lambdas will be small enough that this warning would not
> come along too often.
>
> The impact of this change in the implementation is surprisingly small.  It does
> not affect the serialized form (java.lang.invoke.SerializedLambda), or the
> generated deserialization code ($deserialize$).  It only affects the code which
> generates the lambda method name, which needs access to a small additional bit
> of information -- the assignment target name.  Similarly, detecting the
> condition required for warning is easy -- "sequence number != 1".
>
> Qualitatively, the result is still similar in feel to inner classes -- you can
> make "irrelevant" changes but we make no heroic attempts to recover from things
> like changes in capture order -- but we do a better job of detecting them (and,
> if you follow some coding discipline, you can avoid them entirely.)
>
>