One final stab at improving lambda serialization

Tue Aug 20 04:04:44 PDT 2013

Simple and strict seem like a good approach here.
On local variable names, developers can at least easily see there was
a change with a local variable name, so I don't think that will be a
big issue.
Stephen

On 19 August 2013 17:58, Paul Benedict <pbenedict at apache.org> wrote:
> Thanks Brian for the write-up.
>
> My only contention is the rationale for "Names of captured arguments". I
> disagree with the rationale. Changing local variable names should always be
> permissible and not affect serialization. Local variables are not
> considered "named fields" (you can't even reflect on them), and the
> since-forever expectation is that local variables are not part of the
> equation.
>
> I would be happy with any of these two changes:
> 1) Remove this item
> 2) Only consider the name if the captured variable is an instance variable.
>
> Paul
>
>
> On Mon, Aug 19, 2013 at 10:37 AM, Brian Goetz <brian.goetz at oracle.com>wrote:
>
>> *Background*
>>
>> The fundamental challenge with serialization is that the code that defined
>> a class at serialization time may have changed by the time deserialization
>> happens.  Serialization is defined to be tolerant of change to a certain
>> extent, and admits a degree of customization to allow additional
>> flexibility.
>>
>> For ordinary classes, there are three lines of defense:
>>
>>  * serialVersionUID
>>  * serialization hooks
>>  * default schema evolution
>>
>> Serial version UID for the target class must match exactly.  By default,
>> serialization uses a serial version UID which is a hash of the classes
>> signatures.  So this default approach means "any significant change to the
>> structure of the class (adding new methods, changing method or field
>> signatures, etc) renders serialized forms invalid".  It is a common
>> practice to explicitly assign a serial version UID to a class, thereby
>> disabling this mechanism.
>>
>> Classes that expect to evolve over time may use readObject/writeObject
>> and/or readResolve/writeReplace to customize the mapping between object
>> state and bytestream.  If classes do not use this mechanism, serialization
>> uses a default schema evolution mechanism to adjust for changes in fields
>> between serialization and deserialization time; fields that are present in
>> the bytestream but not in the target class are ignored, and fields that are
>> present in the target class but not the bytestream get default values
>> (zero, null, etc.)
>>
>> Anonymous classes follow the same approach and have access to the same
>> mechanisms (serialVersionUID, read/writeObject, etc), but they have two
>> additional sources of instability:
>>
>>  * The name is generated as EnclosingClass$nnn.  Any change to the set
>>    of anonymous classes in the enclosing class may cause sequence
>>    numbers to change.
>>  * The number and type of fields (appears in bytecode but not source
>>    code) are generated based on the set of captured values. Any change
>>    to the set or order of captured values can cause these signatures to
>>    change (in an unspecified way).
>>
>> If the signatures remain stable, anonymous classes can use serialization
>> hooks to customize the serialized form, just like named classes.
>>
>> The EG has observed that users have largely learned to deal with the
>> problems of serialization of inner classes, either by (a) don't do it, or
>> (b) ensure that essentially the same bits are present on both sides of the
>> pipe, preventing skew from causing instability in either class names or
>> signatures.
>>
>> The EG has set, as a minimum bar, that lambda serialization be "at least
>> as good as" anonymous class serialization.  (This is not a high bar.)
>>  Further, the EG has concluded that gratuitous deviations from anonymous
>> class serialization are undesirable, because, if users have to deal with an
>> imperfect scheme, having them deal with something that is basically the
>> same as an imperfect scheme they've already gotten used to is preferable to
>> dealing with a new and different  scheme.
>>
>> Further, the EG has rejected the idea of arbitrarily restricting access to
>> serialization just because it is dangerous; users who have learned to use
>> it safely should not be unduly encumbered.
>>
>> *Failure modes
>> *
>>
>> For anonymous classes, one of two things will happen when attempting to
>> deserialize after things have changed "too much":
>>
>> 1. A deserialization failure due to either the name or signature not
>>    matching, resulting in NoSuchMethodError,
>>    IncompatibleClassChangeError, etc.
>> 2. Deserializing to the wrong thing, without any evidence of error.
>>
>> Obviously, a type-2 failure is far worse than a type-1 failure, because no
>> error is raised and an unintended computation is performed.  Here are two
>> examples of changes that are behaviorally compatible but which will result
>> in type-2 failures.  The first has to do with order-of-declaration.
>>
>> *Old code**
>> *       *New code**
>> *       *Result**
>> *
>> Runnable r1 = new Runnable() {
>>     void run() {
>>         System.out.println("one");
>>     }
>> };
>> Runnable r2 = new Runnable() {
>>     void run() {
>>         System.out.println("two");
>>     }
>> };
>>         Runnable r2 = new Runnable() {
>>     void run() {
>>         System.out.println("two");
>>     }
>> };
>> Runnable r1 = new Runnable() {
>>     void run() {
>>         System.out.println("one");
>>     }
>> };
>>         Deserialized r1 (across skew) prints "two".
>>
>> This fails because in both cases, we get classes called Foo$1 and Foo$2,
>> but in the old code, these correspond to r1 and r2, but in the new code,
>> these correspond to r2 and r1.
>>
>> The other failure has to do with order-of-capture.
>>
>> *Old code**
>> *       *New code**
>> *       *Result**
>> *
>> String s1 = "foo";
>> String s2 = "bar";
>> Runnable r = new Runnable() {
>>     void run() {
>> foo(s1, s2);
>>     }
>> };
>>
>>         String s1 = "foo";
>> String s2 = "bar";
>> Runnable r = new Runnable() {
>>     void run() {
>>         String s = s2;
>> foo(s1, s);
>>     }
>> };
>>         On deserialization, s1 and s2 are effectively swapped.
>>
>> This fails because the order of arguments in the implicitly generated
>> constructor of the inner class changes due to the order in which the
>> compiler encounters captured variables.  If the reordered variables were of
>> different types, this would cause a type-1 failure, but if they are the
>> same type, it causes a type-2 failure.
>>
>> *User expectations*
>>
>> While experienced users are quick to state the "same bits on both sides"
>> rule for reliable deserialization, a bit of investigation reveals that user
>> expectations are actually higher than that.  For example, if the compiler
>> generated a /random/ name for each lambda at compile time, then recompiling
>> the same source with the same compiler, and using the result for
>> deserialization, would fail.  This is too restrictive; user expectations
>> are not tied to "same bits", but to a vaguer notion of "I compiled
>> essentially the same source with essentially the same compiler, and
>> therefore didn't change anything significant."  For example, users would
>> balk if adding a comment or changing whitespace were to affect
>> deserialization.  Users likely expect (in part, due to behavior of
>> anonymous classes) changes to code that doesn't affect the lambda directly
>> or indirectly (e.g., add or remove a debugging println) also would not
>> affect the serialized form.
>>
>> In the absence of the user being able to explicitly name the lambda /and/
>> its captures (as C++ does), there is no perfect solution.  Instead, our
>> goal can only be to minimize type-2 failures while not unduly creating
>> type-1 failures when "no significant code change" happened.  This means we
>> have to put a stake in the ground as to what constitutes "significant" code
>> change.
>>
>> The de-facto (and likely accidental) definition of "significant" used by
>> inner classes here is:
>>
>>  * Adding, removing, or reordering inner class instances earlier in the
>>    source file;
>>  * Changes to the number, order, or type of captured arguments
>>
>> This permits changes to code that has nothing to do with inner classes,
>> and many common refactorings as long as they do not affect the order of
>> inner class instances or their captures.
>>
>> *Current Lambda behavior*
>>
>> Lambda serialization currently behaves very similarly to anonymous class
>> serialization.  Where anonymous classes have stable method names but
>> unstable class names, lambdas are the dual; unstable method names but
>> stable class names.  But since both are used together, the resulting naming
>> stability is largely the same.
>>
>> We do one thing to increase naming stability for lambdas: we hash the name
>> and signature of the enclosing method in the lambda name. This insulates
>> lambda naming from the addition, removal, or reordering of methods within a
>> class file, but naming stability remains sensitive to the order of lambdas
>> within the method. Similarly, order-of-capture issues are largely similar
>> to inner classes.
>>
>> Lambdas bodies are desugared to methods named in the following form:
>> lambda$/mmm/$/nnn/, where /mmm/ is a hash of the method name and signature,
>> and /nnn/ is a sequence number of lambdas that have the same /mmm/ hash.
>>
>> Because lambdas are instantiated via invokedynamic rather than invoking a
>> constructor directly, there is also slightly more leniency to changes to
>> the /types/ of captured argument; changing a captured argument from, say,
>> String to Object, would be a breaking change for anonymous classes (it
>> changes the constructor signature) but not for lambdas.  This leniency is
>> largely an accidental artifact of translation, rather than a deliberate
>> design decision.
>>
>> *Possible improvements*
>>
>> We can start by recognizing the role of the hash of the enclosing method
>> in the lambda method name.  This reduces the set of lambdas that could
>> collide from "all the lambdas in the file" to "all the lambdas in the
>> method."  This reduces the set of changes that cause both type-1 and type-2
>> errors.
>>
>> An additional observation is that there is a tension between trying to
>> /recover from/ skew (rather than simply trying to detect it, and failing
>> deserialization) and complexity.  So I think we should focus primarily on
>> detecting skew and failing deserialization (turning type-2 failures into
>> type-1) while at the same time not unduly increasing the set of changes
>> that cause type-1 errors, with the goal of settling on an informal
>> guideline of what constitutes "too much" change.
>>
>> We can do this by increasing the number of things that affect the /mmm/
>> hash, effectively constructing the lambda-equivalent of the serialization
>> version UID.  The more context we add to this hash, the smaller the set of
>> lambdas that hash to the same bucket gets, which reduces the space of
>> possible collisions.  The following table shows possible candidates for
>> inclusion, along with examples of code that illustrate dependence on this
>> item.
>>
>> *Item**
>> *       *Old Code**
>> ------------------------------
>> *       *New Code**
>> **----------------------------**--*
>>         *Effect**
>> *       *Rationale**
>> *
>> Names of captured arguments
>>         int x = ...
>> f(() -> x);
>>         int y = ...
>> f(() -> y);     Including the names of captured arguments in the hash
>> would cause rename-refactors of captured arguments to be considered a
>> serialization-breaking change.
>>         While alpha-renaming is generally considered to be
>> semantic-preserving, serialization has always keyed off of names (such as
>> field names) as being clues to developer intent.  It seems reasonable to
>> say "If you change the names involved, we have to assume a semantic change
>> occurred."  We cannot tell if a name change is a simple alpha-rename or
>> capturing a completely different variable, so this is erring on the safe
>> side.
>> Types of captured arguments
>>         String x = ...
>> f(() -> x);     Object x = ...
>> f(() -> x);
>>         It seems reasonable to say that, if you capture arguments of a
>> different type, you've made a semantic change.
>> Order of captured arguments
>>         () -> {
>>     int a = f(x);
>>     int b = g(y);
>>     return h(a,b);
>> };
>>         () -> {
>>     int b = g(y);
>>     int a = f(x);
>>     return h(a,b);
>> };      Changing the order of capture would become a type-1 failure rather
>> than possibly a type-2 failure.
>>         Since we cannot detect whether the ordering change is semantically
>> meaningful or not, it is best to be conservative and say: change to capture
>> order is likely a semantic change.
>> Variable assignment target (if present)
>>         Runnable r1 = Foo::f;
>> Runnable r2 = Foo::g;
>>         Runnable r2 = Foo::g;
>> Runnable r1 = Foo::f;
>>
>>         Including variable target name would render this reordering
>> recoverable and correct
>>         If the user has gone to the effort of providing a name, we can use
>> this as a hint to the meaning of the lambda.
>>
>>         Runnable r = Foo::f;    Runnable runnable = Foo::f;     Including
>> variable target name would render this change (previously recoverable and
>> correct) a deserialiation failure
>>         If the user has changed the name, it seems reasonable to treat
>> that as possibly meaning something else.
>> Target type
>>         Predicate<String> p = String::isEmpty;
>>         Function<String, Boolean> p = String::isEmpty;  Including target
>> type reduces the space of potential sequence number collisions.
>>         If you've changed the target type, it is a different lambda.
>>
>> This list is not exhaustive, and there are others we might consider.  (For
>> example, for lambdas that appear in method invocation context rather than
>> assignment context, we might include the hash of the invoked method name
>> and signature, or even the parameter index or name.  This is where it
>> starts to exhibit diminishing returns and increasing brittleness.)
>>
>> Taken in total, the effect is:
>>
>>  * All order-of-capture issues become type-1 failures, rather than
>>    type-2 failures (modulo hash collisions).
>>  * Order of declaration issues are still present, but they are
>>    dramatically reduced, turning many type-2 failures into type-1 failures.
>>  * Some new type-1 failures are introduced, mostly those deriving from
>>    rename-refactors.
>>
>> The remaining type-2 failures could be dealt with if we added named
>> lambdas in the future.  (They are also prevented if users always assign
>> lambdas to local variables whose names are unique within the method; in
>> this way, the local-variable trick becomes a sort of poor-man's named
>> lambda.)
>>
>> We can reduce the probability of collision further by using a different
>> (and simpler) scheme for non-serializable lambdas (lambda$nnn), so that
>> serializable lambdas can only accidentally collide with each other.
>>
>> However, there are some transformations which we will still not be able to
>> avoid under this scheme.  For example:
>>
>> *Old code**
>> *       *New code**
>> *       *Result**
>> *
>> Supplier<Integer> s =
>> foo ? () -> 1
>>         : () -> 2;
>>         Supplier<Integer> s =
>> !foo ? () -> 2
>>          : () -> 1;     This change is behaviorally compatible but could
>> result in type-2 failure, since both lambdas have the same target type,
>> capture arity, etc.
>>
>> However^2, we can still detect this risk and warn the user.  If for any
>> /mmm/, we issue more than one sequence number /nnn/, we are at risk for a
>> type-2 failure, and can issue a lint warning in that case, suggesting the
>> user refactor to something more stable.  (Who knows what that diagnostic
>> message will look like.) With all the hash information above, it seems
>> likely that the number of potentially colliding lambdas will be small
>> enough that this warning would not come along too often.
>>
>> The impact of this change in the implementation is surprisingly small.  It
>> does not affect the serialized form (java.lang.invoke.**SerializedLambda),
>> or the generated deserialization code ($deserialize$).  It only affects the
>> code which generates the lambda method name, which needs access to a small
>> additional bit of information -- the assignment target name.  Similarly,
>> detecting the condition required for warning is easy -- "sequence number !=
>> 1".
>>
>> Qualitatively, the result is still similar in feel to inner classes -- you
>> can make "irrelevant" changes but we make no heroic attempts to recover
>> from things like changes in capture order -- but we do a better job of
>> detecting them (and, if you follow some coding discipline, you can avoid
>> them entirely.)
>>
>>
>>
>
>
> --
> Cheers,
> Paul