Revisiting default values

Brian Goetz brian.goetz at oracle.com
Wed Mar 17 15:14:26 UTC 2021


Let me propose another strategy for Bucket 3.  It could be implemented 
at either the VM or language level, but the latter probably needs some 
help from the VM anyway.  The idea is that the default value is 
_indistinguishable from null_.  Strawman:

  - Classes can be marked as default-hostile (e.g., `primitive class X 
implements NoGoodDefault`);
  - Prior to dereferencing a default-hostile class, a check is made 
against the default value, and an NPE is thrown if it is the default value;
  - When widening to a reference type, a check is made if it is the 
default value, and if so, is converted to null;
  - When narrowing from a reference type, a check is made for null, and 
if so, converted to the default value;
  - It is allowable to compare `x == null`, which is intepreted as 
"widen x to X.ref, and compare";
- (optional) the interface NoGoodDefault could have a method that 
optimizes the check, such as by using a pivot field, or the language/VM 
could try to automatically pick a pivot field.

Classes which opt for NoGoodDefault will be slower than those that do 
not due to the check, but they will flatten. Essentially, this lets 
authors choose between "zero means default" and "zero means null", at 
some cost.

A risk here is that ignorant users who don't understand the tradeoffs 
will say "oh, great, there's my nullable primitive types", overuse them, 
and then say "primitive types are slow, java sucks."  The goal here 
would be to provide _safety_ for primitive types for which the default 
is dangerous.


On 3/15/2021 11:52 AM, Brian Goetz wrote:
> Picking this issue up again.  To summarize Dan's buckets:
>
> Bucket 1 -- the zero default is in the domain, and is a sensible 
> default value.  Zero for numerics, empty optionals.
>
> Bucket 2 -- there is a sensible default value, but all-zero-bits isn't 
> it.
>
> Bucket 3 -- there simply is no sensible default value.
>
>
> Ultimately, though, this is not about defaults; it is about 
> _uninitialized variables_.  The default only comes into play when the 
> user uses an uninitialized variable, which usually means (a) 
> uninitialized fields or (b) uninitialized array elements.  It is 
> possible that the language could give us seat belts to dramatically 
> narrow the chance of uninitialized fields, but uninitialized array 
> elements are much harder to stamp out.
>
> It is an attractive distraction to get caught up in designing 
> mechanisms for supplying an alternate default ("just let the user 
> declare a no-arg constructor"), but this is focusing on the "writing 
> code" part of the problem, not the "keeping code safe" part of the 
> problem.
>
> In some sense, it is the existence (and size) of Bucket 1 that causes 
> the problem; Bucket 1 is what gives us our sense that it is safe to 
> use uninitialized variables.  In the current language, uninitialized 
> reference variables are also safe in that if you use them before they 
> are initialized, you get an exception before anything bad can happen.  
> Uninitialized primitives in today's language are more dangerous, 
> because we may interpret the uninitialized value, but this has been a 
> problem we've been able to live with because today's primitives are 
> pretty limited and zero is usually a good-enough default in most 
> domains.  As we extend primitives to look more like objects, with 
> behavior, this gets harder.
>
>
> Both buckets 2 and 3 can be remediated without help from the language 
> or VM, perhaps inconveniently, by careful coding on the part of the 
> author of the primitive class:
>
>  - don't expose fields to users (a good practice anyway)
>  - check for zero on entry to each method
>
> These are options A and E.  The difference between Buckets 2 (A) and 3 
> (E) in this model is what do we do when we find a zero; for bucket 2, 
> we substitute some pre-baked value and use that, and for bucket 3, we 
> throw something (what we throw is a separate discussion.)  The various 
> remediation techniques Dan offers represents a menu which allows us to 
> trade off reliability/cost/intrusiveness.
>
> I think we should lean on the model currently implemented by reference 
> types, where _accessing_ an uninitialized field is OK, but _using_ the 
> value in the field is not. If we have:
>
>     String s;
>
> All of the following are fine:
>
>     String t = s;
>     if (s == null) { ... }
>     if (s == t) { ... }
>
> The thing that is not fine is s-dot-something.  These are the E/F/G 
> options, not the H/I options.
>
> Secondarily, H/I, which attempt to hide the default, create another 
> problem down the road: when we get to specialized generics, 
> `T.default` would become partial.
>
> Some of the solutions for Bucket 3 generalize well enough to Bucket 2 
> that we might consider merging them (though there are still messy 
> details).  Option F, for example, injects code at the top of each 
> method body:
>
>     int m() {
>         if (this == <zero-value>)
>             throw new NullPointerException();
>         /* body of m */
>     }
>
> into the top of each method; a corresponding feature for Bucket 2 
> might inject slightly different code:
>
>     int m() {
>         if (this == <zero-value>)
>             return <better-default>.m();
>         /* body of m */
>     }
>
>
> Another thing that has evolved since we started this discussion is 
> recognizing the difference between .val and .ref projections.  Imagine 
> you could declare your membership in bucket 3:
>
>     __bucket_3 primitive class NGD { ... }
>
> If, in addition to some way of generating an NPE on dereference (F, G, 
> etc), we mucked with the conversion of NGD.val to NGD.ref (which the 
> compiler can inject code on), we could actually put a null on top of 
> the stack.  Then, code like:
>
>     if (ngd == null) { ... }
>
> would actually work, because to do the comparison, we'd first promote 
> ngd to a reference type (null is already a reference), and we'd 
> compare two nulls.
>
>
>
> On 7/10/2020 2:23 PM, Dan Smith wrote:
>> Brian pointed out that my list of candidate inline classes in the Identity Warnings JEP (JDK-8249100) includes a number of classes that, despite being "value-based classes" and disavowing their identity, might not end up as inline classes. The problem? Default values.
>>
>> This might be a good time to revisit the open design issues surrounding default values and see if we can make some progress.
>>
>> Background/status quo: every inline class has a default instance, which provides the initial value of fields and array components that have the inline type (e.g., in 'new Point[10]'). It's also the prototype instance used to create all other instances (start with 'vdefault', then apply 'withfield' as needed). The default value is, by fiat, the class instance produced by setting all fields to *their* default values. Often, but not always, this means field/array initialization amounts to setting all the bits to 0. Importantly, no user code is involved in creating a default instance.
>>
>> Real code is always useful for grounding design discussions, so let's start there. Among the classes I listed as inline class candidates, we can put them in three buckets:
>>
>> Bucket #1: Have a reasonable default, as declared.
>> - wrapper classes (the primitive zeros)
>> - Optional & friends (empty)
>> - From java.time: Instant (start of 1970-01-01), LocalTime (midnight), Duration (0s), Period (0d), Year (1 BC, if that's acceptable)
>>
>> Bucket #2: Could have a reasonable default after re-interpreting fields.
>> - From java.time: LocalDate, YearMonth, MonthDay, LocalDateTime, ZonedDateTime, OffsetTime, OffsetDateTime, ZoneOffset, ZoneRegion, MinguoDate, HijrahDate, JapaneseDate, ThaiBuddhistDate (months and days should be nonzero; null Strings, ZoneIds, HijrahChronologies, and JapaneseEras require special handling)
>> - ListN, SetN, MapN (null array interpreted as empty)
>>
>> Bucket #3: No good default.
>> - Runtime.Version (need a non-null List<Integer>)
>> - ProcessHandleImpl (need a valid process ID)
>> - List12, Set12, Map1 (need a non-null value)
>> - All ConstantDesc implementations (need real class & method names, etc.)
>>
>> There's some subjectivity between the 2nd and 3rd buckets, but the idea behind the 2nd is that, with some translation layer between physical fields and interpretation of those fields, we can come up with an intuitive default (e.g., "0 means January"; "a null String means time zone 'UTC'"). In contrast, in the third bucket, any attempt to define a default value is going to be pretty unintuitive ("A null method name means 'toString'").
>>
>> The question here is how much work the JVM and language are willing to do, or how much work we're willing to ask clients to do, in order to support use cases that don't fall into Bucket #1.
>>
>> I don't think totally excluding Buckets #2 and #3 is a very good outcome. It means that, in many cases, inline classes need to be built up exclusively from primitives or other inline types, because if you use reference types, your default value will have a null field. (Sometimes, as in Optional, null fields have straightforward interpretations, but most of the time programs are designed to prevent them.)
>>
>> Whether we support Bucket #2 but not Bucket #3 is a harder question. It wouldn't be so bad if none of the examples above in Bucket #3 become inline classes—for the most part they're handled via interfaces, anyway. (Counterpoint: inline class instances that are immediately typed with interface types still potentially provide a performance boost.) But I'm also not sure this is representative. We've noted before that many use cases, like database records or data structure cursors, don't have meaningful defaults (what's a default mailing address?). The ConstantDesc classes really illustrate this, even though they happen to not be public.
>>
>> Another observation is that if we support Bucket #3 but not Bucket #2, that's probably not a big deal—I'm not sure anybody really *wants* to deal with the default instance; it's just the price you pay for being an inline class. If there's a way to opt out of that extra weirdness and move from Bucket #2 to Bucket #3, great.
>>
>> With that discussion in mind, here are some summaries of approaches we've considered, or that I think we ought to consider, for supporting buckets #2 and #3. (This is as best as I recall. If there's something I've missed, add it to the list!)
>>
>> [Weighing in for myself: my current preference is to do one of F, G, or I. I'm not that interested in supporting Bucket #2, for reasons given above, although Option A works for programmers who really want it.]
>>
>>
>>
>> === Solutions to support Bucket #2 ===
>>
>> Two broad strategies here: re-interpreting fields (A, B), and re-interpreting the default instance (C, D).
>>
>> ---
>>
>> Option A: Encourage programmers to re-interpret fields
>>
>> Guidance to programmers: when you declare an inline class, identify any fields for which the default instance should hold something other than zero/null; define a mapping for your implementation from zero/null to the value you want.
>>
>> One way to do this is to define a (possibly private) getter for each field, and include logic like 'return month + 1' or 'return id == null ? "UTC" : id'. Or maybe you inline that logic, as long as you're careful to do so everywhere. Importantly, you also need to reverse the logic in your constructor—for the sake of '==', if somebody manually creates the default instance, you should  set fields to zero/null.
>>
>> This doesn't work if you want public fields, but that's life as an OO programmer.
>>
>> In this approach, it would be important that inline classes be expected to document their default instance in Javadoc (perhaps with a new Javadoc tag)—the interpretation of the default instance is less apparent to users than "all zeros".
>>
>> Limitations:
>>
>> - It's a fairly error-prone approach. Programmers will absolutely forget to apply the mapping in one place, and everything will be fine until somebody tries to invoke a particular method on the default instance. Put that bug in a security-sensitive context, and maybe you have an exploit. (Something that could help some is choosing good names—call your field 'monthIndex', not plain 'month', to remind yourself that it's zero-based.)
>>
>> - Performance impact of an extra layer of computation on all field accesses. Probably not a big deal in general, but all those null checks, etc., could have a negative impact in certain contexts. And the *appearance* of extra cost might scare programmers away from doing the right thing ("eh, I probably won't use the default value anyway, I'll just ignore it to make my code faster").
>>
>> ---
>>
>> Option B: Language support for field re-interpretation
>>
>> The language allows inline classes to declare fields with mappings to/from an internal representation. Just like Option A, but with guarantees that the internal representation isn't inappropriately accessed directly.
>>
>> This pulls on a thread we explored a bit for Amber awhile back, some form of "abstract fields" or "virtual fields". Maybe there's something there, but it seems like a general-purpose feature, and one we're not likely to reach a final solution on anytime soon.
>>
>> ---
>>
>> Option C: Language support for a designated default
>>
>> The language provides some way for programmers to declare the "logical" default instance (something like a special static field). The compiler inserts a test for the "physical" default on any field/array access, and replaces it with the logical default.
>>
>> That is:
>>
>> Point p = points[3];
>>
>> compiles to
>>
>> point p$0 = points[3];
>> Point p = (p$0 == [vdefault Point]) ? Point.DEFAULT : p$0;
>>
>> This is much less bug-prone than Option A—the compiler does all the work—and much more achievable in the short/medium term than Option B.
>>
>> Compared to Option B, this pushes the computation overhead from inline class field accesses to reads of the inline type from fields/arrays. I don't know if that's good or bad—maybe a wash, heavily dependent on the use case.
>>
>> A few big problems:
>>
>> - The physical default still exists, and malicious bytecode can use it. If programmers want strong guarantees, they'll have to check and throw wherever an untrusted instance is provided. (Clients with access to the inline class's fields have to do so, too.)
>>
>> - Covariant arrays mean every read from any array type that might be flattened (Object[], Runnable[], ConstantDesc[], ...) has to go through translation logic.
>>
>> - There's an assumption here that the programmer doesn't intend to use the physical default as a valid non-default instance. That's hard for the compiler to enforce, and weird stuff happens in fields/arrays if the programmer doesn't prevent it. (Could be mitigated with extra implicit logic on field/array writes or in constructors.)
>>
>> ---
>>
>> Option D: JVM support for a designated default
>>
>> The VM allows inline classes to designate a logical default instance, and the field/array access instructions map from the physical default to the logical default. The 'vdefault' instruction produces the logical default instance; something else is used by the class's factories to build from the physical default.
>>
>> This addresses the first two problems with Option C—the VM gives strong guarantees, and can make the translation a virtual operation of certain arrays.
>>
>> To address the second problem, it seems like we'd need the more complex logic I hinted at: on writes, map the physical default to the logical default, and map the logical default to the physical default. Do the reverse on reads.
>>
>> The problem here is bytecode complexity/slowdowns. We've already added some complexity to 'aaload'/'aastore' (covariant flattened arrays), and anticipate similar changes to 'putfield'/'getfield' (specialized fields), so maybe that means we might as well do more. Or maybe it means we're already over budget. :-)
>>
>>  From the users' perspective, if any performance reduction on reads/writes can be limited to the inline classes in Bucket #2, *all* the options have a similar cost, whether imposed by the programmer, language, or VM. So, to a first approximation, slower opcode execution is fine.
>>
>>
>>
>> === Solutions to support Bucket #3 ===
>>
>> Two broad strategies here: rejecting member accesses on the default instance (E, F, G), and preventing programs from ever seeing the default instance (H, I).
>>
>> ---
>>
>> Option E: Encourage programmers to guard against default instances
>>
>> Guidance to programmers: if you don't like your class's default instance, check for it in your methods and throw. Maybe Java SE defines a new RuntimeException to encourage this.
>>
>> The simple way to do this is with some boilerplate at the start of all your methods:
>>
>> if (this == MyClass.default) throw new InvalidDefaultException();
>>
>> More permissive classes could just do some validation on the fields that are relevant to a particular operation. (E.g., 'getMonth' doesn't care if 'zoneId' is null.)
>>
>> This doesn't work if you want public fields, but that's life as an OO programmer.
>>
>> It's not ideal that an invalid instance can float around a program until somebody trips on one of these checks, rather than detecting the invalid value earlier—we're propagating the NPE problem. And it takes some getting used to that there are two null-like values in the reference type's domain.
>>
>> ---
>>
>> Option F: Language support for default instance guards
>>
>> An inline class declaration can indicate that the default instance is invalid. The compiler generates guards, as in Option E, at the start of all instance method bodies, and perhaps on all field accesses outside of those methods.
>>
>> Programmers give up finer-grained control, but get more safety. I'm sure most would be happy with that trade.
>>
>> Improper/separately-compiled bytecode can skip the field access checks, but that's a minor concern.
>>
>> Same issues as Option E regarding adding a "new NPE" to the platform.
>>
>> ---
>>
>> Option G: JVM support for default instance guards
>>
>> Inline class files can indicate that their default instance is invalid. All attempts to operate on that instance (via field/method accesses, other than 'withfield') result in an exception.
>>
>> This tightens up Option F, making it just as impossible to access members of the default instance as it is to access members of 'null'.
>>
>> Same issues as Option E regarding adding a "new NPE" to the platform.
>>
>> ---
>>
>> Option H: Language checks on field/array reads
>>
>> An inline class declaration can indicate that the default instance is invalid. Every field and array access that may involved an uninitialized field/array component of that inline type gets augmented with a check that rejects reads of the default value (treating it as "you forgot to initialize this variable").
>>
>> That is:
>>
>> Point p = points[3];
>>
>> compiles to
>>
>> point p$0 = points[3];
>> if (p$0 == [vdefault Point]) throw new UninitializedVariableException();
>> Point p = p$0;
>>
>> This is much like Option C, and has roughly the same advantages/problems. There's not a strong guarantee that the default value won't pop up from untrusted bytecode (or unreliable inline class authors), and lots of array types need guards.
>>
>> ---
>>
>> Option I: JVM checks on field/array reads
>>
>> Inline class files can indicate that their default instance is invalid. When reading from a field/array component of the inline type ('getfield'/'getstatic'/'aaload'), an exception is thrown if the default value is found (treating it as "you forgot to initialize this variable"). The 'vdefault' instruction, like 'withfield', is illegal outside of the inline class's nest.
>>
>> Better than Option H in that it can be optimized to occur on only certain reads, and in that it provides strong guarantees—only the inline class can ever "see" the default instance.
>>
>> Well, unless the inline class chooses to share that instance with the world. Not sure how we prevent that. But maybe at that point, anything bad/weird that happens is the author's own fault. (E.g., putting the default value in an array will make that component effectively "uninitialized" again.)
>>
>> Like Option D, there's a question of whether we're willing to add this complexity to the 'getifled'/'getstatic'/'aaload' instructions. My sense is that at least it's less complexity than you have in Option D.
>>
>



More information about the valhalla-spec-observers mailing list