How records would fit into Google's codebase

Wed Apr 3 17:27:54 UTC 2019

Hello, amber-spec-experts. I have done some analysis of Google's codebase,
to see if it can give us any concrete ideas about hoe developers might use
records, and about how we might design records to make them more useful.

A plain-text version of my report follows below - I had prepared a lovely
version with formatting and inline links, but was advised that plain text
is better for this forum. I've enclosed the original HTML document as an
attachment to this email, in case that works better.

Record Proposal Feedback Based On @AutoValue

The most recent version of the records proposal puts many restrictions on
records, with the aim of producing a more focused, opinionated tool. Most
notably: record fields must be final; record classes must be final; field
accessors must be public. There has broadly been support for these ideas,
from Google and from other JDK contributors: it appeals to our
sensibilities. However, there have also been some questions asked about
whether the restrictions imposed will hinder adoption, and it’s hard to
estimate that in the abstract.

Google would like to ensure Java gets the best possible version of the
record feature, and so in addition to thinking abstractly about what
features sound good for records, I have spent some time collecting concrete
data from the Google codebase, to determine how well a version of records
would fit in there, and what changes to the records proposal might make it
better.

A natural starting point is to compare with @AutoValue, an annotation
processor Google publishes and uses to auto-generate implementations for
simple data classes (for details, see
https://github.com/google/auto/blob/master/value/userguide/index.md). We
use these internally for roughly the same kinds of things records might be
used for, and so data about how they are used will help when evaluating the
impact of various facets of the records proposal. Accordingly, I have
collected some statistics about all the @AutoValue classes in Google’s
codebase.

Below I share this data, and present some recommendations based on it. When
it is useful and possible, I also include simplified and fictionalized code
samples based on the real code in Google’s codebase, to clarify what
patterns I am talking about.

One additional thing I would have liked to do is to somehow find classes
which were “almost” records, but which ended up not using @AutoValue. A
survey of these might help us decide what changes we could make to increase
adoption, or simply confirm for us, “Yes, it’s a good thing we included
restriction X, because this class is a bad candidate for a record, but
without restriction X it might have become a record”. Alas, unsurprisingly,
it is much easier to find actual @AutoValue classes than to design a
heuristic for “almost an @AutoValue”, and so I have not done this.

Summary and Recommendations (TL;DR)

For those who don’t care to slog through the whole document, I present here
a summary of recommendations. Many of these recommendations simply echo
what is already in the proposal, because the data I found supports the
proposal.

* Records can expect to be about as common as enums
* Records should be allowed to implement interfaces, and perhaps to extend
a superclass, but should not allow their own subclasses (except perhaps
abstract records)
* Records should have a very lightweight syntax for the common case,
preferably one line. @AutoValue’s clunkier syntax may be reducing adoption
for some use cases.
* Records should be immutable. The language should enforce shallow
mutability (by making all fields final), and style guides should recommend
that deep immutability.
* We should consider adding an alternative way to construct records besides
a constructor with positional parameters. Builders are popular for
@AutoValue, but perhaps something better could be done for records as a
language feature.
* Records do not need language-level support for withFoo methods, or
toBuilder, even if builder support is included.
* Records do not need a way to include private/hidden state, or to memoize
derived properties
* Records should allow implementing Object methods by hand, rejecting the
auto-generated implementation, but expect this to be done rarely

Popularity

One simple question to ask is: how often would records be used? If records
end up being used very rarely, we may regret spending too much time
designing and implementing them, or may wish we had made them more flexible
instead of more restrictive. To measure popularity, I simply asked “what
percentage of all named, non-local classes are @AutoValue classes?” I limit
to named, non-local classes because these are the scopes in which
@AutoValue is able to operate; a record integrated more tightly with the
language could replace local classes, but @AutoValue can’t. The answer is:
3.0% of Google’s named, concrete classes are @AutoValue classes. Is that a
lot, or a little? In isolation it’s hard to say. For comparison, I asked
how many of Google’s classes fall into other interesting categories. At two
opposite ends of the spectrum, 8.9% of classes are anonymous, while just
0.05% of classes are method-local named classes. A particularly promising
comparison is to enums, a feature which similarly gives up some flexibility
for increased expressive power, and to which Brian refers in the records
proposal. Of Google’s named, concrete classes, 3.7% of them are enums.

So, it seems we are past the first hurdle: there is a healthy appetite for
“simple immutable data carriers”, if we make a compelling offering. We can
expect them to be less popular than anonymous classes, but roughly as
popular as enums. Or perhaps more popular: quite possibly, developers would
be more keen to use records if they were a language feature instead of a
Google library. It’s hard to know for sure.

Superclasses

How often do @AutoValue classes make use of inheritance? 77% of @AutoValue
classes are “islands” in the inheritance graph: they extend Object, and
implement no interfaces. 15% of @AutoValue classes extend Object and
implement exactly one interface. A mere 4% extend some class other than
Object, and implement no interfaces. Very few do anything else (implement
2+ interfaces, or implement interface(s) while also extending a class).

Just like @AutoValue, the current proposal plans to allow implementing
interfaces and/or extending a class. That seems reasonable, but extending a
class is rare enough that we could consider forbidding it if that fits
better onto the semantic goals of “simple data carriers”: it won’t harm too
large a percentage of the usages. Perhaps, for example, we could reserve
the extends syntax for the future possibility of extending “abstract
records” that Brian suggested. Still, it would be reasonable to allow
normal subclassing, if we judge that it helps achieve the semantic goals of
records.

One possible source of skew in this data: since @AutoValue determines its
fields by identifying abstract 0-argument methods, rather than by
explicitly listing them, some @AutoValue classes “inherit their API” by
extending an abstract class, or implementing an interface, containing some
abstract accessor methods. I don’t have a good estimate for how common this
is. Such uses are perhaps arguments in favor of the “extending abstract
records” feature. However, it is also interesting for a non-record class to
conform to the same interface. Consider, for example, this pattern I have
seen a few times:

public interface JobInfo {
  String session();
  boolean privileged();
  Instant startTime();
}

@AutoValue public abstract class FakeJobInfo implements JobInfo {
  // Note no fields specified: they are inherited from JobInfo!
  Builder builder() {return new AutoValue_FakeJobInfo.Builder();}
  @AutoValue.Builder public interface Builder {
    Builder setSession(String session);
    Builder setPrivileged(boolean privileged);
    Builder setStartTime(Instant startTime);
    FakeJobInfo build();
  }
}

public class JobRegistry {
  private Database database;
  // ...
  private class JobInfoImpl implements JobInfo {
    private JobId id;
    public JobInfoImpl(JobId id) {this.id = id;}
    public String session() {return database.lookupSession(id);}
    public boolean privileged() {return database.isPrivileged(id);}
    public Instant startTime() {
      return Instant.ofEpochSecond(database.startTime(id));
    }
  }
}

Of course I have simplified the logic in JobInfoImpl, but the idea is that
there is an interface for looking things up, a fake implementation (used in
tests) that is a simple record, and a non-record implementation for
production use that gets its data from some other source.

I think the records proposal as written already supports this use case
semantically: we can define the interface first, then define a record that
implements it. However, we don’t get any code for free: we have to repeat
the list of properties, defining them once in the interface and then again
in the record’s list of fields. One interesting possibility would be some
connection between records and interfaces. Perhaps a record definition
could, upon request, also produce an interface that can be conformed to by
some non-record implementation. Alternatively, a record could inherit
fields based on the interfaces it implements, as @AutoValue does; however,
since records do not normally have their fields defined via abstract
methods, I think this approach fits less well for records than it does for
@AutoValue.

Subclasses

Another inheritance-related question: should subclasses of records be
allowed? Google’s @AutoValue documentation strongly discourages this, but
we cannot make it illegal because @AutoValue uses subclassing as an
implementation detail (it generates a subclass of your abstract class). So,
we can ask how many authors decided to write additional subclasses despite
the warnings. Just 0.26% of @AutoValue classes have subclasses (aside from
the auto-generated one that we expect). An inspection of some of these
instances doesn’t suggest a compelling case for subclassing of records.
These subclasses are mostly stubs for testing: for example, overriding
accessors to throw an exception, or to return a value that could have just
been specified in the constructor. The authors do not seem to be
intentionally working around some restriction of @AutoValue.

One example:

@AutoValue public abstract class Document {
  public abstract String text();
  public abstract Language language();
  // and 4 more fields...
}

public class DocumentTestHelper {
  public static Document instance() {return INSTANCE;}
  private static final Document INSTANCE = new ThrowingDocument();
  private static class ThrowingPoint extends Point {
    private UnsupportedOperationException cannotDoThis() {
      return new UnsupportedOperationException("Cannot use this!");
    }
    @Override public String text() {throw cannotDoThis();}
    @Override public Language language() {throw cannotDoThis();}
    // and 4 more fields doing the same thing...
  }
}

The rarity of subclasses (and lack of any convincing subclasses) argues in
favor of the restriction that all records must be final (except, perhaps,
some kind of abstract record).

Visibility and Scoping

I wanted to know whether people are using @AutoValue mostly for public API
“contracts”, or for internal implementation details. To answer this
question, I asked of each @AutoValue class whether it a nested class or
top-level, and what visibility modifier it declares. This misses some
subtlety: I didn’t pay attention to effective visibility, so a public
@AutoValue nested inside a package-private class would look public in this
analysis.

62% of @AutoValue classes are top-level classes, of which 84% are public;
the rest are necessarily package-private. The remaining 38% of @AutoValue
classes are nested classes, divided evenly between public and
package-private (none are protected or private, because @AutoValue does not
support such visibilities). Thus, almost 75% of @AutoValue classes are
public, suggesting they are used as part of some contract between two or
more classes.

This is a bit higher than I would have guessed. I suspect it is because
while @AutoValue does a good job of meeting the “semantic goals” of
records, it still has a fair amount of boilerplate, and developers do not
like to go to the trouble of defining one for one-off data types used
within a method. Consider Brian’s topThreePeople example. I have reproduced
it below for convenience, and included an alternate implementation using
@AutoValue:

public class PersonDatabase {
    List<Person> topThreePeopleUsingRecord(List<Person> list) {
        record PersonX(Person p, int hash) {
            PersonX(Person p) {
                this(p, p.name().toUpperCase().hashCode());
            }
        }

        return list.stream()
                   .map(PersonX::new)
                   .sorted(Comparator.comparingInt(PersonX::hash))
                   .limit(3)
                   .map(PersonX::p)
                   .collect(toList());
    }

    @AutoValue abstract static class HashedPerson {
        public abstract Person p();
        public abstract int hash();
        public static HashedPerson create(Person p) {
            return new AutoValue_PersonDatabase_HashedPerson(p, p.name
().toUpperCase().hashCode());
        }
    }

    List<Person> topThreePeopleUsingAutoValue(List<Person> list) {
        return list.stream()
                   .map(HashedPerson::create)
                   .sorted(Comparator.comparingInt(HashedPerson::hash))
                   .limit(3)
                   .map(HashedPerson::p)
                   .collect(toList());
    }
}

The @AutoValue “record” takes 7 lines to declare instead of 5, requires you
to look up or remember the special naming scheme it uses, and also “leaks”
into the enclosing class from the method where it really belongs. If we
want records to be more useful as implementation details, we should ensure
there is a very low-overhead way of defining them. Brian’s current proposal
is promising in this regard, allowing a simple one-line definition for
lightweight records.

Mutability

The most recent restriction proposed for records is that all fields must be
final. Google broadly encourages immutability, and so we support this idea,
but can we prove that developers agree? It’s hard to collect unbiased data
on this: since @AutoValue doesn’t define fields, but rather defines named
accessors and hides the fields from you, there is no way for a developer to
say “hey wait, I wanted a mutable field”, except by defining the field
themselves...but even this is hard to do! @AutoValue allows you to define
fields independently, but it will only call the no-argument constructor of
your class, so there’s no way to initialize those fields except by relying
on side effects. So, developers who really want a mutable field won’t be
using @AutoValue, and won’t appear in the data I collected.

However, we have static analysis tools that issue compiler warnings if you
put an array or other mutable object (e.g. collection) in an @AutoValue. So
instead of looking at the current state of the codebase, I looked at data
about how developers reacted to the static analysis warnings. I  sampled
304 instances of warnings where someone felt strongly enough to point them
out during code review: 272 of these actions were to say “this is a good
warning, and you should fix your class”, and 32 were to say “This warning
is not useful in this case.” I do not have data for developers who saw the
warning and fixed it on their own before getting to code review.

This warning is a relatively recent addition to our static analysis
tooling, so there are some committed instances of @AutoValue classes with
array fields from before that time, and additionally some cases where
developers have reacted to the new warning by simply adding
@SuppressWarnings to their existing @AutoValue class. 0.49% of @AutoValue
classes have an array member, and 0.06% of @AutoValueclasses contain a
@SuppressWarnings annotation for this warning.

So, broadly it seems that developers agree that it’s better for records to
be deeply immutable, but a small percentage of rebels yearn to mutate their
data carriers, or at any rate don’t want to refactor their legacy code to
hew closer to the semantics of records.

Construction

When defining an @AutoValue, you don’t get a public constructor for free,
the way a proposed record would. Instead, you get a private generated
constructor for free, and must either define a static factory method that
delegates to the constructor, or define an abstract class to act as a
builder for you; in the latter case, @AutoValue implements the builder, but
there is still a fair bit more code to write in defining the methods that
the builder class should have.

Half of @AutoValue classes do the simplest thing: they define a single
static factory which delegates to the generated constructor. 10% define two
different factories. These groups, totaling 60% of @AutoValue classes, map
well onto records as currently proposed. However, a third of @AutoValue
classes think it’s worth the trouble to define a builder instead of, or in
addition to, the static factory, even though they have to write a bunch
more code to support it. This is an area where records could serve
developers’ needs better, by offering some kind of opt-in support for
generating a builder to go with your data carrier.

But why do people want a builder? How do they use it? Perhaps what they
really want is named arguments, or default arguments. A builder may just be
the best @AutoValue can do with the language as-is, but a new feature can
try something bolder. In some use cases I see, every use of the builder
sets every field explicitly, so the main advantage of a builder is that the
field names are associated with their values at the construction call site.
Such classes would probably be happy with a constructor with named
arguments.

On the other hand, I also see use cases where the builder is used to avoid
specifying values for Optional<T> fields. Consider:
public record Response(Optional<ServiceId> provider,
                       Optional<ResponseType> responseType,
                       Optional<Action> action,
                       Optional<String> referenceUrl) {
  // Empty. This is a very bland record.
}

// ...

public Response respond(Action action) {
  return new Response(Optional.empty(), Optional.of(ResponseType.ACTION),
Optional.of(action), Optional.empty());
}

// ...

public Response redirect(String url) {
  return new Response(Optional.empty(), Optional.empty(), Optional.empty(),
Optional.of(url));
}

All the Optional wrappers have muddied up the call sites a lot, and the use
of positional constructor parameters makes it hard to tell what is being
specified in each usage. The @AutoValue version of this record uses a
builder, and so replaces redirect with:

public Response redirect(String url) {
  return Response.builder().referenceUrl(url).build();
}

“Modification”

Of course with records being immutable, you can’t modify an existing
record. But is it common to ask for a “modified version” of a record,
copying a subset of fields but changing others? An often-suggested feature
for records is support for “wither methods”: methods like

MyRecord withFoo(int newFoo) {return new MyRecord(newFoo, this.bar);}

As it turns out, defining methods like these is not very common. 3% of
@AutoValue classes C have at least one instance method returning C -
probably not all of these are “wither” methods, but many of them are. This
is a small enough percentage of classes that we could reasonably exclude
this feature from records: “if you want it that badly, you can do it
yourself”.

@AutoValue supports another kind of “modification” that I expected to be
more popular: toBuilder(). If your data carrier uses a builder for its
construction, you can ask @AutoValue to generate a toBuilder() method,
which converts an existing value into a builder, so that you can ask for a
subset of fields to be changed before solidifying back down into an
immutable value. But it turns out this feature is used very rarely: only
1.5% of @AutoValue classes with builders use this feature, which is less
than 1% of all @AutoValue classes. So even considering wither methods and
toBuilder together, less than 5% of @AutoValue classes use this feature.

Perhaps if records could define builders and withers for you automatically
and with very little boilerplate, these features would be used more often,
but they don’t seem to fill a need so common that developers feel compelled
to write them by hand. It doesn’t seem like a high priority to support
wither methods, or toBuilder(), even if support for builders is added.

Hidden State

How will developers feel about the restriction that each field corresponds
to a constructor parameter and a public accessor? Will they wish they could
have some local state? We can look at two things in @AutoValue classes to
identify developers who fit into these categories. First, they may define
private fields which do not participate in @AutoValue generation. This
turns out to be quite rare: less than 1% of @AutoValue classes have such
properties. It makes sense to not support this, as hidden state both goes
against the semantic goals of records and would go unused by most
developers.

However, there is a more restricted notion of private “state” that may be
more suitable, and which @AutoValue supports directly: memoization of
derived properties. Developers can tag any nullary method with @Memoize,
and the generated @AutoValue class will cache the return value of that
method in a private field. This seems reasonably compatible with the
semantic goals of records, and could be worth supporting if it is used
regularly.

However, despite being very easy to use, @Memoize is not very popular. Only
1.4% of @AutoValue classes memoize any properties. The most obvious things
to memoize are hashCode and toString, and those are indeed the two
most-memoized methods, but in total it is still pretty rare. Of @AutoValue
classes which memoize something, only 14% memoize these methods: most have
some other derived property that they want to cache.

So, while it might be nice to offer support for lazy/cached methods,
leaving it out will likely not have a significant impact on record
adoption. If lazy instance fields ever make it into the language, we can
retrofit them into records at that time. If memoization support is
included, it should cover all properties, not just Object overrides.
Manually Written Methods
Both records and @AutoValue will automatically provide correct
implementations of equals(Object) and hashCode(), as well as a reasonable
toString(). How often do developers feel the need to override these methods?

toString(), it turns out, is most common by a landslide, but still rare: 3%
of @AutoValue classes have a manual implementation of toString(). Some
examples:

@AutoValue public abstract class Constraint {
    // ...
    @Override public final String toString() {
      return String.format(
          "Constraint_%s_%s_%s_%s_%s",
          cluster().name(), machine().name(),
          machineIntent(), subinterval(), constraint());
    }
}

@AutoValue public abstract class SensitiveString {
  public abstract String getValue();

  public static SensitiveString of(String value) {
    return new AutoValue_SensitiveString(value);
  }

  // Prevents sensitive strings accidentally being rendered.
  @Override public final String toString() {
    return "*";
  }
}

equals(Object) and hashCode() are only overridden around 0.5% of the time.
Developers are generally happy with auto-generated value semantics for
their simple data carriers. I looked at some of the overriding
implementations of these methods - they often just wanted a hashCode that
was faster, at the expense of having more collisions. In one case I found,
one of the fields being wrapped was of a class with an incorrect hashCode
implementation, and so the @AutoValue author hashed it externally. To allow
workarounds like this, allowing overrides is a good idea, but we can expect
it to be used rarely if the automatic implementations of Object methods are
generally suitable.

In addition to overriding Object methods, there are other method signatures
that crop up multiple times. Most common, although still less common than
toString(), are conversion functions like toJson, toProto(), or toBuilder()
(see Construction section). Much more rare, at around 0.1%, are methods
like iterator() and size(): some @AutoValue classes wrap a single
ImmutableCollection of some kind, and implement methods that delegate to
this field. This could be an argument in favor of the recent
method-delegation proposal, but it is a pretty rare thing to do, and many
of these cases are really not a great idea: they should just call
foo.coll().iterator() instead of foo.iterator(), and having Foo implement
Iterable brings relatively little benefit.

Footnote: Google’s Codebase

A brief reminder about the value of using Google’s codebase to answer
questions like these. Our codebase is large, easy to analyze, and highly
cultivated, through static-analysis tools, enforced code-review, etc. In
some ways, it does represent “what good Java code looks like”, but it also
has some peculiarities, such as a weird fascination with protobufs. So,
keep in mind that when I make claims about how code looks, I am talking
specifically about Google’s codebase, and not about all Java code in the
universe.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.java.net/pipermail/amber-spec-experts/attachments/20190403/213c72f0/attachment-0001.html>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.java.net/pipermail/amber-spec-experts/attachments/20190403/213c72f0/JDKRecordsProposalReport-0001.html>