Addressing the full range of use cases

Mon Oct 4 23:34:37 UTC 2021

When we talk about use cases for Valhalla, we've often considered a very broad set of class abstractions that represent immutable, identity-free data. JEP 401 mentions varieties of integers and floats, points, dates and times, tuples, records, subarrays, cursors, etc. However, as shorthand this broad set often gets reduced to an example like Point or Int128, and these latter examples are not necessarily representative of all candidate value types.  

Specifically, our favorite example classes have a property that doesn't generalize: they'll happily accept any combination of field values as a valid instance. (In fact, they're even happy to accept any combination of *bits* of the appropriate length.) Many candidate primitive classes don't have this property—the constructors do important validation work, and only certain combinations of fields are allowed to represent valid instances.

Related areas of concern that we've had on the radar for awhile:

- The "all zeros is your default value" strategy forces an all-zero instance into the class's value set, even if that doesn't make sense for the class. Many candidate classes have no reasonable default at all, leading naturally to wish for "null is your default value" (or other, more exotic, strategies involving revisiting the idea that every type has a default value). We've provided 'P.ref' for those use sites that *need* null, but haven't provided a complete story for value types that want it to be *their* default value, too.

- Non-atomic heap updates can be used to create new instances that arbitrary combine previously-validated instances' fields. There is no guarantee that the new combination of fields is semantically valid. Again, while there's precedent for this with 'double' and 'long' (JLS 17.7), those are special cases that don't generalize—any combination of double bit fields is *still a valid double*. (This is usually described as "tearing", although JLS 17.6 has something else in mind when it uses that word...) The language provides 'volatile' as a use-site opt-in to atomicity, and we've toyed with a declaration-site opt-in as well. But object integrity being "off" by default may not be ideal.

- Existing class types like LocalDate are both nullable and atomic. These are useful properties to preserve during migration; nullability, in particular, is essential for source compatibility. We've provided reference-default declarations as a mechanism to make reference types (which have these properties) the default, with 'P.val' as an opt-in to value types. But in doing so we take away the many benefits of value types by default, and force new code to work with the "bad name".

While we can provide enough knobs to accommodate all of these special cases, we're left with a complex user model which asks class authors to make n different choices they may not immediately grasp the consequences of, and class users to keep 2^n different categories straight in their heads.

As an alternative, we've been exploring whether a simpler model is workable. It is becoming clear that there are (at least) two clusters of uses for value types.  The "classic" value types are like numerics -- they'll happily accept any combination of field values as a valid instance, and the zero value is a sensible (often the best possible) default value.  They make relatively little use of encapsulation.  These are the ones that best "work like an int."  The "encapsulated" value types are those that are more like typical aggregates ("codes like a class") -- their constructors do important validation work, and only certain combinations of fields are allowed to represent valid instances.  These are more likely to not have valid zero values (and hence want to be nullable).  

Some questions to consider for this approach:

- How do we group features into clusters so that they meet the sweet spot of user expectations and use cases while minimizing complexity? Is two clusters the right number? Is two already too many? (And what do we call them? What keywords best convey the intended intuitions?)

- If there are knobs within the clusters, what are the right defaults? E.g., should atomicity be opt-in or opt-out?

- What are the performance costs (or, in the other direction, performance gains) associated with each feature? For certain feature combinations, have we canceled out the performance gains over identity classes (and at that point, is that combination even worth supporting?)