How records would fit into Google's codebase

Wed Apr 3 17:46:15 UTC 2019

Ah, and Liam has kindly hosted that HTML document on cr.openjdk for me,
since I don't have an account there yet.

So, anyone looking through the archives can find a fancier version of the
report at:

 http://cr.openjdk.java.net/~cushon/amalloy/JDKRecordsProposalReport.html

On Wed, Apr 3, 2019 at 10:27 AM Alan Malloy <amalloy at google.com> wrote:

> Hello, amber-spec-experts. I have done some analysis of Google's codebase,
> to see if it can give us any concrete ideas about hoe developers might use
> records, and about how we might design records to make them more useful.
>
> A plain-text version of my report follows below - I had prepared a lovely
> version with formatting and inline links, but was advised that plain text
> is better for this forum. I've enclosed the original HTML document as an
> attachment to this email, in case that works better.
>
> Record Proposal Feedback Based On @AutoValue
>
> The most recent version of the records proposal puts many restrictions on
> records, with the aim of producing a more focused, opinionated tool. Most
> notably: record fields must be final; record classes must be final; field
> accessors must be public. There has broadly been support for these ideas,
> from Google and from other JDK contributors: it appeals to our
> sensibilities. However, there have also been some questions asked about
> whether the restrictions imposed will hinder adoption, and it’s hard to
> estimate that in the abstract.
>
> Google would like to ensure Java gets the best possible version of the
> record feature, and so in addition to thinking abstractly about what
> features sound good for records, I have spent some time collecting concrete
> data from the Google codebase, to determine how well a version of records
> would fit in there, and what changes to the records proposal might make it
> better.
>
> A natural starting point is to compare with @AutoValue, an annotation
> processor Google publishes and uses to auto-generate implementations for
> simple data classes (for details, see
> https://github.com/google/auto/blob/master/value/userguide/index.md). We
> use these internally for roughly the same kinds of things records might be
> used for, and so data about how they are used will help when evaluating the
> impact of various facets of the records proposal. Accordingly, I have
> collected some statistics about all the @AutoValue classes in Google’s
> codebase.
>
> Below I share this data, and present some recommendations based on it.
> When it is useful and possible, I also include simplified and fictionalized
> code samples based on the real code in Google’s codebase, to clarify what
> patterns I am talking about.
>
> One additional thing I would have liked to do is to somehow find classes
> which were “almost” records, but which ended up not using @AutoValue. A
> survey of these might help us decide what changes we could make to increase
> adoption, or simply confirm for us, “Yes, it’s a good thing we included
> restriction X, because this class is a bad candidate for a record, but
> without restriction X it might have become a record”. Alas, unsurprisingly,
> it is much easier to find actual @AutoValue classes than to design a
> heuristic for “almost an @AutoValue”, and so I have not done this.
>
> Summary and Recommendations (TL;DR)
>
> For those who don’t care to slog through the whole document, I present
> here a summary of recommendations. Many of these recommendations simply
> echo what is already in the proposal, because the data I found supports the
> proposal.
>
> * Records can expect to be about as common as enums
> * Records should be allowed to implement interfaces, and perhaps to extend
> a superclass, but should not allow their own subclasses (except perhaps
> abstract records)
> * Records should have a very lightweight syntax for the common case,
> preferably one line. @AutoValue’s clunkier syntax may be reducing adoption
> for some use cases.
> * Records should be immutable. The language should enforce shallow
> mutability (by making all fields final), and style guides should recommend
> that deep immutability.
> * We should consider adding an alternative way to construct records
> besides a constructor with positional parameters. Builders are popular for
> @AutoValue, but perhaps something better could be done for records as a
> language feature.
> * Records do not need language-level support for withFoo methods, or
> toBuilder, even if builder support is included.
> * Records do not need a way to include private/hidden state, or to memoize
> derived properties
> * Records should allow implementing Object methods by hand, rejecting the
> auto-generated implementation, but expect this to be done rarely
>
> Popularity
>
> One simple question to ask is: how often would records be used? If records
> end up being used very rarely, we may regret spending too much time
> designing and implementing them, or may wish we had made them more flexible
> instead of more restrictive. To measure popularity, I simply asked “what
> percentage of all named, non-local classes are @AutoValue classes?” I limit
> to named, non-local classes because these are the scopes in which
> @AutoValue is able to operate; a record integrated more tightly with the
> language could replace local classes, but @AutoValue can’t. The answer is:
> 3.0% of Google’s named, concrete classes are @AutoValue classes. Is that a
> lot, or a little? In isolation it’s hard to say. For comparison, I asked
> how many of Google’s classes fall into other interesting categories. At two
> opposite ends of the spectrum, 8.9% of classes are anonymous, while just
> 0.05% of classes are method-local named classes. A particularly promising
> comparison is to enums, a feature which similarly gives up some flexibility
> for increased expressive power, and to which Brian refers in the records
> proposal. Of Google’s named, concrete classes, 3.7% of them are enums.
>
> So, it seems we are past the first hurdle: there is a healthy appetite for
> “simple immutable data carriers”, if we make a compelling offering. We can
> expect them to be less popular than anonymous classes, but roughly as
> popular as enums. Or perhaps more popular: quite possibly, developers would
> be more keen to use records if they were a language feature instead of a
> Google library. It’s hard to know for sure.
>
> Superclasses
>
> How often do @AutoValue classes make use of inheritance? 77% of @AutoValue
> classes are “islands” in the inheritance graph: they extend Object, and
> implement no interfaces. 15% of @AutoValue classes extend Object and
> implement exactly one interface. A mere 4% extend some class other than
> Object, and implement no interfaces. Very few do anything else (implement
> 2+ interfaces, or implement interface(s) while also extending a class).
>
> Just like @AutoValue, the current proposal plans to allow implementing
> interfaces and/or extending a class. That seems reasonable, but extending a
> class is rare enough that we could consider forbidding it if that fits
> better onto the semantic goals of “simple data carriers”: it won’t harm too
> large a percentage of the usages. Perhaps, for example, we could reserve
> the extends syntax for the future possibility of extending “abstract
> records” that Brian suggested. Still, it would be reasonable to allow
> normal subclassing, if we judge that it helps achieve the semantic goals of
> records.
>
> One possible source of skew in this data: since @AutoValue determines its
> fields by identifying abstract 0-argument methods, rather than by
> explicitly listing them, some @AutoValue classes “inherit their API” by
> extending an abstract class, or implementing an interface, containing some
> abstract accessor methods. I don’t have a good estimate for how common this
> is. Such uses are perhaps arguments in favor of the “extending abstract
> records” feature. However, it is also interesting for a non-record class to
> conform to the same interface. Consider, for example, this pattern I have
> seen a few times:
>
> public interface JobInfo {
>   String session();
>   boolean privileged();
>   Instant startTime();
> }
>
> @AutoValue public abstract class FakeJobInfo implements JobInfo {
>   // Note no fields specified: they are inherited from JobInfo!
>   Builder builder() {return new AutoValue_FakeJobInfo.Builder();}
>   @AutoValue.Builder public interface Builder {
>     Builder setSession(String session);
>     Builder setPrivileged(boolean privileged);
>     Builder setStartTime(Instant startTime);
>     FakeJobInfo build();
>   }
> }
>
> public class JobRegistry {
>   private Database database;
>   // ...
>   private class JobInfoImpl implements JobInfo {
>     private JobId id;
>     public JobInfoImpl(JobId id) {this.id = id;}
>     public String session() {return database.lookupSession(id);}
>     public boolean privileged() {return database.isPrivileged(id);}
>     public Instant startTime() {
>       return Instant.ofEpochSecond(database.startTime(id));
>     }
>   }
> }
>
> Of course I have simplified the logic in JobInfoImpl, but the idea is that
> there is an interface for looking things up, a fake implementation (used in
> tests) that is a simple record, and a non-record implementation for
> production use that gets its data from some other source.
>
> I think the records proposal as written already supports this use case
> semantically: we can define the interface first, then define a record that
> implements it. However, we don’t get any code for free: we have to repeat
> the list of properties, defining them once in the interface and then again
> in the record’s list of fields. One interesting possibility would be some
> connection between records and interfaces. Perhaps a record definition
> could, upon request, also produce an interface that can be conformed to by
> some non-record implementation. Alternatively, a record could inherit
> fields based on the interfaces it implements, as @AutoValue does; however,
> since records do not normally have their fields defined via abstract
> methods, I think this approach fits less well for records than it does for
> @AutoValue.
>
> Subclasses
>
> Another inheritance-related question: should subclasses of records be
> allowed? Google’s @AutoValue documentation strongly discourages this, but
> we cannot make it illegal because @AutoValue uses subclassing as an
> implementation detail (it generates a subclass of your abstract class). So,
> we can ask how many authors decided to write additional subclasses despite
> the warnings. Just 0.26% of @AutoValue classes have subclasses (aside from
> the auto-generated one that we expect). An inspection of some of these
> instances doesn’t suggest a compelling case for subclassing of records.
> These subclasses are mostly stubs for testing: for example, overriding
> accessors to throw an exception, or to return a value that could have just
> been specified in the constructor. The authors do not seem to be
> intentionally working around some restriction of @AutoValue.
>
> One example:
>
> @AutoValue public abstract class Document {
>   public abstract String text();
>   public abstract Language language();
>   // and 4 more fields...
> }
>
> public class DocumentTestHelper {
>   public static Document instance() {return INSTANCE;}
>   private static final Document INSTANCE = new ThrowingDocument();
>   private static class ThrowingPoint extends Point {
>     private UnsupportedOperationException cannotDoThis() {
>       return new UnsupportedOperationException("Cannot use this!");
>     }
>     @Override public String text() {throw cannotDoThis();}
>     @Override public Language language() {throw cannotDoThis();}
>     // and 4 more fields doing the same thing...
>   }
> }
>
> The rarity of subclasses (and lack of any convincing subclasses) argues in
> favor of the restriction that all records must be final (except, perhaps,
> some kind of abstract record).
>
> Visibility and Scoping
>
> I wanted to know whether people are using @AutoValue mostly for public API
> “contracts”, or for internal implementation details. To answer this
> question, I asked of each @AutoValue class whether it a nested class or
> top-level, and what visibility modifier it declares. This misses some
> subtlety: I didn’t pay attention to effective visibility, so a public
> @AutoValue nested inside a package-private class would look public in this
> analysis.
>
> 62% of @AutoValue classes are top-level classes, of which 84% are public;
> the rest are necessarily package-private. The remaining 38% of @AutoValue
> classes are nested classes, divided evenly between public and
> package-private (none are protected or private, because @AutoValue does not
> support such visibilities). Thus, almost 75% of @AutoValue classes are
> public, suggesting they are used as part of some contract between two or
> more classes.
>
> This is a bit higher than I would have guessed. I suspect it is because
> while @AutoValue does a good job of meeting the “semantic goals” of
> records, it still has a fair amount of boilerplate, and developers do not
> like to go to the trouble of defining one for one-off data types used
> within a method. Consider Brian’s topThreePeople example. I have reproduced
> it below for convenience, and included an alternate implementation using
> @AutoValue:
>
> public class PersonDatabase {
>     List<Person> topThreePeopleUsingRecord(List<Person> list) {
>         record PersonX(Person p, int hash) {
>             PersonX(Person p) {
>                 this(p, p.name().toUpperCase().hashCode());
>             }
>         }
>
>         return list.stream()
>                    .map(PersonX::new)
>                    .sorted(Comparator.comparingInt(PersonX::hash))
>                    .limit(3)
>                    .map(PersonX::p)
>                    .collect(toList());
>     }
>
>     @AutoValue abstract static class HashedPerson {
>         public abstract Person p();
>         public abstract int hash();
>         public static HashedPerson create(Person p) {
>             return new AutoValue_PersonDatabase_HashedPerson(p, p.name
> ().toUpperCase().hashCode());
>         }
>     }
>
>     List<Person> topThreePeopleUsingAutoValue(List<Person> list) {
>         return list.stream()
>                    .map(HashedPerson::create)
>                    .sorted(Comparator.comparingInt(HashedPerson::hash))
>                    .limit(3)
>                    .map(HashedPerson::p)
>                    .collect(toList());
>     }
> }
>
> The @AutoValue “record” takes 7 lines to declare instead of 5, requires
> you to look up or remember the special naming scheme it uses, and also
> “leaks” into the enclosing class from the method where it really belongs.
> If we want records to be more useful as implementation details, we should
> ensure there is a very low-overhead way of defining them. Brian’s current
> proposal is promising in this regard, allowing a simple one-line definition
> for lightweight records.
>
> Mutability
>
> The most recent restriction proposed for records is that all fields must
> be final. Google broadly encourages immutability, and so we support this
> idea, but can we prove that developers agree? It’s hard to collect unbiased
> data on this: since @AutoValue doesn’t define fields, but rather defines
> named accessors and hides the fields from you, there is no way for a
> developer to say “hey wait, I wanted a mutable field”, except by defining
> the field themselves...but even this is hard to do! @AutoValue allows you
> to define fields independently, but it will only call the no-argument
> constructor of your class, so there’s no way to initialize those fields
> except by relying on side effects. So, developers who really want a mutable
> field won’t be using @AutoValue, and won’t appear in the data I collected.
>
> However, we have static analysis tools that issue compiler warnings if you
> put an array or other mutable object (e.g. collection) in an @AutoValue. So
> instead of looking at the current state of the codebase, I looked at data
> about how developers reacted to the static analysis warnings. I  sampled
> 304 instances of warnings where someone felt strongly enough to point them
> out during code review: 272 of these actions were to say “this is a good
> warning, and you should fix your class”, and 32 were to say “This warning
> is not useful in this case.” I do not have data for developers who saw the
> warning and fixed it on their own before getting to code review.
>
> This warning is a relatively recent addition to our static analysis
> tooling, so there are some committed instances of @AutoValue classes with
> array fields from before that time, and additionally some cases where
> developers have reacted to the new warning by simply adding
> @SuppressWarnings to their existing @AutoValue class. 0.49% of @AutoValue
> classes have an array member, and 0.06% of @AutoValueclasses contain a
> @SuppressWarnings annotation for this warning.
>
> So, broadly it seems that developers agree that it’s better for records to
> be deeply immutable, but a small percentage of rebels yearn to mutate their
> data carriers, or at any rate don’t want to refactor their legacy code to
> hew closer to the semantics of records.
>
> Construction
>
> When defining an @AutoValue, you don’t get a public constructor for free,
> the way a proposed record would. Instead, you get a private generated
> constructor for free, and must either define a static factory method that
> delegates to the constructor, or define an abstract class to act as a
> builder for you; in the latter case, @AutoValue implements the builder, but
> there is still a fair bit more code to write in defining the methods that
> the builder class should have.
>
> Half of @AutoValue classes do the simplest thing: they define a single
> static factory which delegates to the generated constructor. 10% define two
> different factories. These groups, totaling 60% of @AutoValue classes, map
> well onto records as currently proposed. However, a third of @AutoValue
> classes think it’s worth the trouble to define a builder instead of, or in
> addition to, the static factory, even though they have to write a bunch
> more code to support it. This is an area where records could serve
> developers’ needs better, by offering some kind of opt-in support for
> generating a builder to go with your data carrier.
>
> But why do people want a builder? How do they use it? Perhaps what they
> really want is named arguments, or default arguments. A builder may just be
> the best @AutoValue can do with the language as-is, but a new feature can
> try something bolder. In some use cases I see, every use of the builder
> sets every field explicitly, so the main advantage of a builder is that the
> field names are associated with their values at the construction call site.
> Such classes would probably be happy with a constructor with named
> arguments.
>
> On the other hand, I also see use cases where the builder is used to avoid
> specifying values for Optional<T> fields. Consider:
> public record Response(Optional<ServiceId> provider,
>                        Optional<ResponseType> responseType,
>                        Optional<Action> action,
>                        Optional<String> referenceUrl) {
>   // Empty. This is a very bland record.
> }
>
> // ...
>
> public Response respond(Action action) {
>   return new Response(Optional.empty(), Optional.of(ResponseType.ACTION),
> Optional.of(action), Optional.empty());
> }
>
> // ...
>
> public Response redirect(String url) {
>   return new Response(Optional.empty(), Optional.empty(),
> Optional.empty(), Optional.of(url));
> }
>
> All the Optional wrappers have muddied up the call sites a lot, and the
> use of positional constructor parameters makes it hard to tell what is
> being specified in each usage. The @AutoValue version of this record uses a
> builder, and so replaces redirect with:
>
> public Response redirect(String url) {
>   return Response.builder().referenceUrl(url).build();
> }
>
> “Modification”
>
> Of course with records being immutable, you can’t modify an existing
> record. But is it common to ask for a “modified version” of a record,
> copying a subset of fields but changing others? An often-suggested feature
> for records is support for “wither methods”: methods like
>
> MyRecord withFoo(int newFoo) {return new MyRecord(newFoo, this.bar);}
>
> As it turns out, defining methods like these is not very common. 3% of
> @AutoValue classes C have at least one instance method returning C -
> probably not all of these are “wither” methods, but many of them are. This
> is a small enough percentage of classes that we could reasonably exclude
> this feature from records: “if you want it that badly, you can do it
> yourself”.
>
> @AutoValue supports another kind of “modification” that I expected to be
> more popular: toBuilder(). If your data carrier uses a builder for its
> construction, you can ask @AutoValue to generate a toBuilder() method,
> which converts an existing value into a builder, so that you can ask for a
> subset of fields to be changed before solidifying back down into an
> immutable value. But it turns out this feature is used very rarely: only
> 1.5% of @AutoValue classes with builders use this feature, which is less
> than 1% of all @AutoValue classes. So even considering wither methods and
> toBuilder together, less than 5% of @AutoValue classes use this feature.
>
> Perhaps if records could define builders and withers for you automatically
> and with very little boilerplate, these features would be used more often,
> but they don’t seem to fill a need so common that developers feel compelled
> to write them by hand. It doesn’t seem like a high priority to support
> wither methods, or toBuilder(), even if support for builders is added.
>
> Hidden State
>
> How will developers feel about the restriction that each field corresponds
> to a constructor parameter and a public accessor? Will they wish they could
> have some local state? We can look at two things in @AutoValue classes to
> identify developers who fit into these categories. First, they may define
> private fields which do not participate in @AutoValue generation. This
> turns out to be quite rare: less than 1% of @AutoValue classes have such
> properties. It makes sense to not support this, as hidden state both goes
> against the semantic goals of records and would go unused by most
> developers.
>
> However, there is a more restricted notion of private “state” that may be
> more suitable, and which @AutoValue supports directly: memoization of
> derived properties. Developers can tag any nullary method with @Memoize,
> and the generated @AutoValue class will cache the return value of that
> method in a private field. This seems reasonably compatible with the
> semantic goals of records, and could be worth supporting if it is used
> regularly.
>
> However, despite being very easy to use, @Memoize is not very popular.
> Only 1.4% of @AutoValue classes memoize any properties. The most obvious
> things to memoize are hashCode and toString, and those are indeed the two
> most-memoized methods, but in total it is still pretty rare. Of @AutoValue
> classes which memoize something, only 14% memoize these methods: most have
> some other derived property that they want to cache.
>
> So, while it might be nice to offer support for lazy/cached methods,
> leaving it out will likely not have a significant impact on record
> adoption. If lazy instance fields ever make it into the language, we can
> retrofit them into records at that time. If memoization support is
> included, it should cover all properties, not just Object overrides.
> Manually Written Methods
> Both records and @AutoValue will automatically provide correct
> implementations of equals(Object) and hashCode(), as well as a reasonable
> toString(). How often do developers feel the need to override these methods?
>
> toString(), it turns out, is most common by a landslide, but still rare:
> 3% of @AutoValue classes have a manual implementation of toString(). Some
> examples:
>
> @AutoValue public abstract class Constraint {
>     // ...
>     @Override public final String toString() {
>       return String.format(
>           "Constraint_%s_%s_%s_%s_%s",
>           cluster().name(), machine().name(),
>           machineIntent(), subinterval(), constraint());
>     }
> }
>
> @AutoValue public abstract class SensitiveString {
>   public abstract String getValue();
>
>   public static SensitiveString of(String value) {
>     return new AutoValue_SensitiveString(value);
>   }
>
>   // Prevents sensitive strings accidentally being rendered.
>   @Override public final String toString() {
>     return "*";
>   }
> }
>
> equals(Object) and hashCode() are only overridden around 0.5% of the time.
> Developers are generally happy with auto-generated value semantics for
> their simple data carriers. I looked at some of the overriding
> implementations of these methods - they often just wanted a hashCode that
> was faster, at the expense of having more collisions. In one case I found,
> one of the fields being wrapped was of a class with an incorrect hashCode
> implementation, and so the @AutoValue author hashed it externally. To allow
> workarounds like this, allowing overrides is a good idea, but we can expect
> it to be used rarely if the automatic implementations of Object methods are
> generally suitable.
>
> In addition to overriding Object methods, there are other method
> signatures that crop up multiple times. Most common, although still less
> common than toString(), are conversion functions like toJson, toProto(), or
> toBuilder() (see Construction section). Much more rare, at around 0.1%, are
> methods like iterator() and size(): some @AutoValue classes wrap a single
> ImmutableCollection of some kind, and implement methods that delegate to
> this field. This could be an argument in favor of the recent
> method-delegation proposal, but it is a pretty rare thing to do, and many
> of these cases are really not a great idea: they should just call
> foo.coll().iterator() instead of foo.iterator(), and having Foo implement
> Iterable brings relatively little benefit.
>
> Footnote: Google’s Codebase
>
> A brief reminder about the value of using Google’s codebase to answer
> questions like these. Our codebase is large, easy to analyze, and highly
> cultivated, through static-analysis tools, enforced code-review, etc. In
> some ways, it does represent “what good Java code looks like”, but it also
> has some peculiarities, such as a weird fascination with protobufs. So,
> keep in mind that when I make claims about how code looks, I am talking
> specifically about Google’s codebase, and not about all Java code in the
> universe.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.java.net/pipermail/amber-spec-experts/attachments/20190403/c710166e/attachment-0001.html>