One final stab at improving lambda serialization
Raab, Donald
Donald.Raab at gs.com
Wed Aug 21 19:29:23 PDT 2013
+1
> -----Original Message-----
> From: Zakharov, Vladimir [Tech]
> Sent: Wednesday, August 21, 2013 10:26 PM
> To: 'Doug Lea'; 'Brian Goetz'
> Cc: 'lambda-libs-spec-experts at openjdk.java.net'; 'lambda-spec-
> experts at openjdk.java.net'; Raab, Donald [Tech]
> Subject: RE: One final stab at improving lambda serialization
>
> I agree. Looks like this proposal strikes just the right balance: it is not being
> unnecessarily (and unexpectedly) restrictive while avoiding "silent" failures. I
> am supportive of serialization behavior implemented as described below.
>
> Thank you,
> Vlad
>
> The Goldman Sachs Group, Inc. All rights reserved.
> See http://www.gs.com/disclaimer/global_email for important risk
> disclosures, conflicts of interest and other terms and conditions relating to
> this e-mail and your reliance on information contained in it. This message
> may contain confidential or privileged information. If you are not the
> intended recipient, please advise us immediately and delete this
> message. See http://www.gs.com/disclaimer/email for further information
> on confidentiality and the risks of non-secure electronic communication. If
> you cannot access these links, please notify us by reply message and we will
> send the contents to you.
>
>
> -----Original Message-----
> From: lambda-libs-spec-experts-bounces at openjdk.java.net [mailto:lambda-
> libs-spec-experts-bounces at openjdk.java.net] On Behalf Of Doug Lea
> Sent: Monday, August 19, 2013 1:44 PM
> To: Brian Goetz
> Cc: lambda-libs-spec-experts at openjdk.java.net; lambda-spec-
> experts at openjdk.java.net
> Subject: Re: One final stab at improving lambda serialization
>
>
> This is the sort of scheme I had in mind in my reply to David Lloyd (that might
> have in part led to this effort). And the details seem reasonable to me. Looks
> good to go, assuming no unexpected implementation snags.
>
> -Doug
>
>
> On 08/19/2013 11:37 AM, Brian Goetz wrote:
> > *Background*
> >
> > The fundamental challenge with serialization is that the code that
> > defined a class at serialization time may have changed by the time
> > deserialization happens. Serialization is defined to be tolerant of
> > change to a certain extent, and admits a degree of customization to allow
> additional flexibility.
> >
> > For ordinary classes, there are three lines of defense:
> >
> > * serialVersionUID
> > * serialization hooks
> > * default schema evolution
> >
> > Serial version UID for the target class must match exactly. By
> > default, serialization uses a serial version UID which is a hash of
> > the classes signatures. So this default approach means "any
> > significant change to the structure of the class (adding new methods,
> > changing method or field signatures,
> > etc) renders serialized forms invalid". It is a common practice to
> > explicitly assign a serial version UID to a class, thereby disabling this
> mechanism.
> >
> > Classes that expect to evolve over time may use readObject/writeObject
> > and/or readResolve/writeReplace to customize the mapping between
> > object state and bytestream. If classes do not use this mechanism,
> > serialization uses a default schema evolution mechanism to adjust for
> > changes in fields between serialization and deserialization time;
> > fields that are present in the bytestream but not in the target class
> > are ignored, and fields that are present in the target class but not
> > the bytestream get default values (zero, null, etc.)
> >
> > Anonymous classes follow the same approach and have access to the same
> > mechanisms (serialVersionUID, read/writeObject, etc), but they have
> > two additional sources of instability:
> >
> > * The name is generated as EnclosingClass$nnn. Any change to the set of
> > anonymous classes in the enclosing class may cause sequence numbers
> to change.
> > * The number and type of fields (appears in bytecode but not source
> code) are
> > generated based on the set of captured values. Any change to the set or
> > order of captured values can cause these signatures to change (in an
> > unspecified way).
> >
> > If the signatures remain stable, anonymous classes can use
> > serialization hooks to customize the serialized form, just like named classes.
> >
> > The EG has observed that users have largely learned to deal with the
> > problems of serialization of inner classes, either by (a) don't do it,
> > or (b) ensure that essentially the same bits are present on both sides
> > of the pipe, preventing skew from causing instability in either class names
> or signatures.
> >
> > The EG has set, as a minimum bar, that lambda serialization be "at
> > least as good as" anonymous class serialization. (This is not a high
> > bar.) Further, the EG has concluded that gratuitous deviations from
> > anonymous class serialization are undesirable, because, if users have
> > to deal with an imperfect scheme, having them deal with something that
> > is basically the same as an imperfect scheme they've already gotten
> > used to is preferable to dealing with a new and different scheme.
> >
> > Further, the EG has rejected the idea of arbitrarily restricting
> > access to serialization just because it is dangerous; users who have
> > learned to use it safely should not be unduly encumbered.
> >
> > *Failure modes
> > *
> >
> > For anonymous classes, one of two things will happen when attempting
> > to deserialize after things have changed "too much":
> >
> > 1. A deserialization failure due to either the name or signature not
> matching,
> > resulting in NoSuchMethodError, IncompatibleClassChangeError, etc.
> > 2. Deserializing to the wrong thing, without any evidence of error.
> >
> > Obviously, a type-2 failure is far worse than a type-1 failure,
> > because no error is raised and an unintended computation is performed.
> > Here are two examples of changes that are behaviorally compatible but
> > which will result in type-2 failures. The first has to do with order-of-
> declaration.
> >
> > *Old code**
> > * *New code**
> > * *Result**
> > *
> > Runnable r1 = new Runnable() {
> > void run() {
> > System.out.println("one");
> > }
> > };
> > Runnable r2 = new Runnable() {
> > void run() {
> > System.out.println("two");
> > }
> > };
> > Runnable r2 = new Runnable() {
> > void run() {
> > System.out.println("two");
> > }
> > };
> > Runnable r1 = new Runnable() {
> > void run() {
> > System.out.println("one");
> > }
> > };
> > Deserialized r1 (across skew) prints "two".
> >
> > This fails because in both cases, we get classes called Foo$1 and
> > Foo$2, but in the old code, these correspond to r1 and r2, but in the
> > new code, these correspond to r2 and r1.
> >
> > The other failure has to do with order-of-capture.
> >
> > *Old code**
> > * *New code**
> > * *Result**
> > *
> > String s1 = "foo";
> > String s2 = "bar";
> > Runnable r = new Runnable() {
> > void run() {
> > foo(s1, s2);
> > }
> > };
> >
> > String s1 = "foo";
> > String s2 = "bar";
> > Runnable r = new Runnable() {
> > void run() {
> > String s = s2;
> > foo(s1, s);
> > }
> > };
> > On deserialization, s1 and s2 are effectively swapped.
> >
> > This fails because the order of arguments in the implicitly generated
> > constructor of the inner class changes due to the order in which the
> > compiler encounters captured variables. If the reordered variables
> > were of different types, this would cause a type-1 failure, but if
> > they are the same type, it causes a type-2 failure.
> >
> > *User expectations*
> >
> > While experienced users are quick to state the "same bits on both
> > sides" rule for reliable deserialization, a bit of investigation
> > reveals that user expectations are actually higher than that. For
> > example, if the compiler generated a /random/ name for each lambda at
> > compile time, then recompiling the same source with the same compiler,
> > and using the result for deserialization, would fail. This is too
> > restrictive; user expectations are not tied to "same bits", but to a
> > vaguer notion of "I compiled essentially the same source with
> > essentially the same compiler, and therefore didn't change anything
> > significant." For example, users would balk if adding a comment or
> > changing whitespace were to affect deserialization. Users likely
> > expect (in part, due to behavior of anonymous classes) changes to code
> > that doesn't affect the lambda directly or indirectly (e.g., add or remove a
> debugging println) also would not affect the serialized form.
> >
> > In the absence of the user being able to explicitly name the lambda
> > /and/ its captures (as C++ does), there is no perfect solution.
> > Instead, our goal can only be to minimize type-2 failures while not
> > unduly creating type-1 failures when "no significant code change"
> > happened. This means we have to put a stake in the ground as to what
> constitutes "significant" code change.
> >
> > The de-facto (and likely accidental) definition of "significant" used
> > by inner classes here is:
> >
> > * Adding, removing, or reordering inner class instances earlier in the
> source
> > file;
> > * Changes to the number, order, or type of captured arguments
> >
> > This permits changes to code that has nothing to do with inner
> > classes, and many common refactorings as long as they do not affect
> > the order of inner class instances or their captures.
> >
> > *Current Lambda behavior*
> >
> > Lambda serialization currently behaves very similarly to anonymous
> > class serialization. Where anonymous classes have stable method names
> > but unstable class names, lambdas are the dual; unstable method names
> > but stable class names. But since both are used together, the
> > resulting naming stability is largely the same.
> >
> > We do one thing to increase naming stability for lambdas: we hash the
> > name and signature of the enclosing method in the lambda name. This
> > insulates lambda naming from the addition, removal, or reordering of
> > methods within a class file, but naming stability remains sensitive to
> > the order of lambdas within the method. Similarly, order-of-capture issues
> are largely similar to inner classes.
> >
> > Lambdas bodies are desugared to methods named in the following form:
> > lambda$/mmm/$/nnn/, where /mmm/ is a hash of the method name and
> > signature, and /nnn/ is a sequence number of lambdas that have the same
> /mmm/ hash.
> >
> > Because lambdas are instantiated via invokedynamic rather than
> > invoking a constructor directly, there is also slightly more leniency
> > to changes to the /types/ of captured argument; changing a captured
> > argument from, say, String to Object, would be a breaking change for
> > anonymous classes (it changes the constructor signature) but not for
> > lambdas. This leniency is largely an accidental artifact of translation, rather
> than a deliberate design decision.
> >
> > *Possible improvements*
> >
> > We can start by recognizing the role of the hash of the enclosing
> > method in the lambda method name. This reduces the set of lambdas
> > that could collide from "all the lambdas in the file" to "all the
> > lambdas in the method." This reduces the set of changes that cause both
> type-1 and type-2 errors.
> >
> > An additional observation is that there is a tension between trying to
> > /recover from/ skew (rather than simply trying to detect it, and
> > failing deserialization) and complexity. So I think we should focus
> > primarily on detecting skew and failing deserialization (turning
> > type-2 failures into type-1) while at the same time not unduly
> > increasing the set of changes that cause type-1 errors, with the goal of
> settling on an informal guideline of what constitutes "too much" change.
> >
> > We can do this by increasing the number of things that affect the
> > /mmm/ hash, effectively constructing the lambda-equivalent of the
> > serialization version UID. The more context we add to this hash, the
> > smaller the set of lambdas that hash to the same bucket gets, which
> reduces the space of possible collisions.
> > The following table shows possible candidates for inclusion, along
> > with examples of code that illustrate dependence on this item.
> >
> > *Item**
> > * *Old Code**
> > ------------------------------
> > * *New Code**
> > **------------------------------*
> > *Effect**
> > * *Rationale**
> > *
> > Names of captured arguments
> > int x = ...
> > f(() -> x);
> > int y = ...
> > f(() -> y); Including the names of captured arguments in the hash
> would cause
> > rename-refactors of captured arguments to be considered a
> > serialization-breaking change.
> > While alpha-renaming is generally considered to be
> > semantic-preserving, serialization has always keyed off of names (such
> > as field names) as being clues to developer intent. It seems
> > reasonable to say "If you change the names involved, we have to assume
> > a semantic change occurred." We cannot tell if a name change is a
> > simple alpha-rename or capturing a completely different variable, so this is
> erring on the safe side.
> > Types of captured arguments
> > String x = ...
> > f(() -> x); Object x = ...
> > f(() -> x);
> > It seems reasonable to say that, if you capture arguments of a
> > different type, you've made a semantic change.
> > Order of captured arguments
> > () -> {
> > int a = f(x);
> > int b = g(y);
> > return h(a,b);
> > };
> > () -> {
> > int b = g(y);
> > int a = f(x);
> > return h(a,b);
> > }; Changing the order of capture would become a type-1 failure rather
> than
> > possibly a type-2 failure.
> > Since we cannot detect whether the ordering change is semantically
> > meaningful or not, it is best to be conservative and say: change to
> > capture order is likely a semantic change.
> > Variable assignment target (if present)
> > Runnable r1 = Foo::f;
> > Runnable r2 = Foo::g;
> > Runnable r2 = Foo::g;
> > Runnable r1 = Foo::f;
> >
> > Including variable target name would render this reordering
> recoverable and correct
> > If the user has gone to the effort of providing a name, we can use
> > this as a hint to the meaning of the lambda.
> >
> > Runnable r = Foo::f; Runnable runnable = Foo::f; Including
> variable target
> > name would render this change (previously recoverable and correct) a
> > deserialiation failure
> > If the user has changed the name, it seems reasonable to treat that
> > as possibly meaning something else.
> > Target type
> > Predicate<String> p = String::isEmpty;
> > Function<String, Boolean> p = String::isEmpty; Including target type
> reduces
> > the space of potential sequence number collisions.
> > If you've changed the target type, it is a different lambda.
> >
> > This list is not exhaustive, and there are others we might consider.
> > (For example, for lambdas that appear in method invocation context
> > rather than assignment context, we might include the hash of the
> > invoked method name and signature, or even the parameter index or
> > name. This is where it starts to exhibit diminishing returns and
> > increasing brittleness.)
> >
> > Taken in total, the effect is:
> >
> > * All order-of-capture issues become type-1 failures, rather than type-2
> > failures (modulo hash collisions).
> > * Order of declaration issues are still present, but they are dramatically
> > reduced, turning many type-2 failures into type-1 failures.
> > * Some new type-1 failures are introduced, mostly those deriving from
> > rename-refactors.
> >
> > The remaining type-2 failures could be dealt with if we added named
> > lambdas in the future. (They are also prevented if users always
> > assign lambdas to local variables whose names are unique within the
> > method; in this way, the local-variable trick becomes a sort of
> > poor-man's named lambda.)
> >
> > We can reduce the probability of collision further by using a
> > different (and
> > simpler) scheme for non-serializable lambdas (lambda$nnn), so that
> > serializable lambdas can only accidentally collide with each other.
> >
> > However, there are some transformations which we will still not be
> > able to avoid under this scheme. For example:
> >
> > *Old code**
> > * *New code**
> > * *Result**
> > *
> > Supplier<Integer> s =
> > foo ? () -> 1
> > : () -> 2;
> > Supplier<Integer> s =
> > !foo ? () -> 2
> > : () -> 1; This change is behaviorally compatible but could result in
> > type-2 failure, since both lambdas have the same target type, capture arity,
> etc.
> >
> > However^2, we can still detect this risk and warn the user. If for
> > any /mmm/, we issue more than one sequence number /nnn/, we are at
> > risk for a type-2 failure, and can issue a lint warning in that case,
> > suggesting the user refactor to something more stable. (Who knows
> > what that diagnostic message will look
> > like.) With all the hash information above, it seems likely that the
> > number of potentially colliding lambdas will be small enough that this
> > warning would not come along too often.
> >
> > The impact of this change in the implementation is surprisingly small.
> > It does not affect the serialized form
> > (java.lang.invoke.SerializedLambda), or the generated deserialization
> > code ($deserialize$). It only affects the code which generates the
> > lambda method name, which needs access to a small additional bit of
> > information -- the assignment target name. Similarly, detecting the
> condition required for warning is easy -- "sequence number != 1".
> >
> > Qualitatively, the result is still similar in feel to inner classes --
> > you can make "irrelevant" changes but we make no heroic attempts to
> > recover from things like changes in capture order -- but we do a
> > better job of detecting them (and, if you follow some coding
> > discipline, you can avoid them entirely.)
> >
> >
More information about the lambda-libs-spec-experts
mailing list