Evolving CONSTANT_Class

Tue Jun 2 22:13:40 UTC 2020

We've had multiple discussions over the course of the project about how to evolve the CONSTANT_Class JVM constant pool entry to support new kinds of types, and perhaps address its existing shortcomings. For prototyping, we settled on sticking "QVal;" into CONSTANT_Class, with plans to revisit it later. I think it would be helpful at this point to develop a consensus plan.

[NB: this is a rich text email with some tables. If it doesn't come through properly, let me know and I'll post the tables somewhere.]

---

Some background reference material (if you want to do extra homework):

https://mail.openjdk.java.net/pipermail/amber-spec-experts/2020-May/002175.html
I discuss some of our thinking on the distinction between classes and class types.

https://mail.openjdk.java.net/pipermail/valhalla-spec-experts/2020-April/001288.html
Fred proposes migrating to an approach in which a CONSTANT_Class is always a descriptor string.

https://mail.openjdk.java.net/pipermail/valhalla-spec-experts/2019-March/000907.html
John describes CONSTANT_Descriptor entries for specialization and as an alternative that CONSTANT_Class can reference.

http://cr.openjdk.java.net/~dlsmith/lw2/lw2-20190628/specs/inline-classes-jvms.html#jvms-4.4.1
I specify the status quo treatment of CONSTANT_Classes that use Q descriptors.

https://mail.openjdk.java.net/pipermail/valhalla-spec-experts/2018-October/000770.html
Brian describes the constant pool entries used by Model 3 for specialization.

We've also had many internal discussions at Oracle about various string-like and tree-like descriptor encodings.

---

Uses of CONSTANT_Class

In Java SE 14, a CONSTANT_Class is a constant pool entry that wraps a string. Syntactically, the string can be one of two things: a binary name (encoded with slash separators), or an array type descriptor.

A Class entry can be referenced by a ClassFile structure, field/method references, a handful of instructions, and a few different attributes.

Conceptually, one of two different things is being modeled: sometimes, we want to represent a class or interface; other times, we want to represent a type. It's easy to conflate these two concepts, but it will be beneficial as we evolve the type system if we more carefully think of classes and interfaces as "declared entities" or "symbols" that can be referenced by certain types, rather than being types themselves.

Whether or not an array type is allowed in a certain context is a good clue about whether the CONSTANT_Class represents a class/interface or a type. When we contemplate other types that might be allowed in the contexts in which a CONSTANT_Class can appear, this becomes clearer.

(What other types are we adding? Inline (Q) types, obviously. Probably species types to represent particular specializations of classes. Primitive types might be nice. And, with specializable constant pools, type variables and array/species types with nested type variables.)

Here's a table listing all the type-flavored uses (where "X" means "allowed here" and "~" means "maybe not essential, but the semantics would be clear"):

Class name
Array type
Inline type
Species
Primitive
Type Var
Nested tvar
Fieldref
X
X (useless)
~
X
~
?
X
Methodref
X
X
~
X
~
?
X

instanceof
X
X
X
X
~
X
X
checkcast
X
X
X
X
~
X
X
anewarray
X
X
X
X
~
X
X
multianewarray

X

X
ldc
X
X
X
X
~
X
X
defaultvalue
~
~
X
~
~
X
~

exception_table
X
X (VE)
~
X
~
X
X
StackMapTable
X
X
X
X
~
X
X

Exceptions
X
~
~
X
~
X
X

In this context, whenever there's a class name, we implicitly treat it as the 'L' type of that class or interface.

Fieldref/Methodref are an anomaly, where the current Valhalla design tries to maintain that it's a class/interface reference, not a type (e.g., we prohibit inline types here), even though array types can also be used. We might be happier embracing that the class_index of a Fieldref/Methodref is the type of the first argument and a type to search for members. (I'm on the fence about this.)

multianewarray is an interesting case, where the only thing allowed is an array type.

And here are the class/interface flavored uses:

Class name
Array type
Inline type
Species
Primitive
Type Var
Nested tvar
this_class
X

super_class
X

X

?
X
interfaces
X

X

?
X

new
X

X

X

NestHost
X
X (useless)

NestMembers
X
X (useless)

InnerClasses
X

EnclosingMethod
X

Module
X

ModuleMainClass
X

Species are interesting, because they appear in both lists. One way to handle that is to treat 'super_class', 'interfaces', and 'new' as heavily-restricted type uses. Another way to handle it is to distinguish between a *species*, which is a class-like entity, and a *species type*. It's helpful to remember that there may be inline types of species (that is, a "Q envelope" of a species).

I didn't check the "Species" box for "this_class", but I could, depending on how we encode type parameters.

NestHost/NestMembers: The fact that an array type can appear as a meaningless NestHost, while we probably wouldn't want to support any other kind of type there, indicates to me that it's a bug that the array type doesn't cause a ClassFormatError.

In addition to all of the above, it's worth noting that types also appear frequently as field and method descriptor Utf8 constants, principally via field_info, method_info, and Fieldrefs/Methodrefs. The main difference between Class constants representing types and "naked" descriptors representing types is that the Class constants will typically be resolved.

So, summarizing the status quo: we have Class constants representing classes and interfaces (and maybe species), Class constants representing types, and naked descriptor strings representing types.

---

Lumping strategies

One way to encode class/interface and type references in the constant pool is to lump everything together under the CONSTANT_Class heading.

This was the strategy used initially in the design of the JVM: sometimes you might want to talk about a class *or* an array type, so we'll let CONSTANT_Class represent them both.

I'm not sure that was a good choice—we end up with some awkward overloading, illustrated by the tables above, and lots of reference-site restrictions saying "but you can't use an array type here"—but it's the legacy we've got.

I see two directions we can take this:

1) Treat everything in the class/interface table as a degenerate use of a type. A class name is always interpreted as an L type.

You end up with some degrees of freedom that don't really make sense (if an inline class is a nest host, do we require a Q type? do we just map whatever type we get, maybe even a species, to a class, and then work with that class?). But you also end up with a constant pool where there's never any ambiguity about what a Class constant represents (it's a type!).

2) Continue with the class vs. type overloading of Class constants.

This eliminates unwanted degrees of freedom, although there's a certain amount of complexity in managing the dual nature of Class constants.

In this space, a bare class name should be viewed as representing a class/interface. In type-flavored contexts, we interpret that class/interface as if it were an L type. In contexts that require a class/interface, any form of Class constant *other than* a bare name is a ClassFormatError.

It might be useful here to formally put Class constants in two buckets, based on the syntax: a "Class constant representing a type" or a "Class constant representing a class or interface".

Some concrete issues for lumping strategies (both (1) and (2)):

- When a Class constant is viewed as a type (for (1) that's always, for (2) that's for type-flavored references), the implicit L envelope is a historical wart. Do we also support explicit L descriptors? Do we try to migrate the world away from the implicit envelopes?

- Should we add primitive types? How are they spelled? (The standard descriptor syntax for primitives is already interpreted as a bare class name.)

- How do we handle type variables, both top-level and nested? Either we embed constant pool pointers in Utf8 entries (yuck!), or we need to extend Class constants to support references both to Utf8 entries and to [some new thing].

- Should we revisit "naked" descriptor references, allowing them to point to either bare Utf8 entries or Class constants and MethodType/[something else] constants? Do we try to migrate the world away from naked descriptor references?

---

Splitting strategies

Another direction we can go is to view Class constants as representing classes/interfaces, and introduce new constants to represent types.

This gives us a clean discipline for distinguishing between type uses and class/interface uses, and reduces the burden on format checking to have lots of constraints of the form "the CONSTANT_Class must refer to a string that represents ____." It uses the "type system" of the class file grammar to implicitly enforce those rules.

A couple of possible approaches in this space:

3) Introduce multiple constant structures to represent different kinds of types, and call them all "type constants".

So there's CONSTANT_ArrayType, CONSTANT_InlineClassType, CONSTANT_SpeciesType, CONSTANT_PrimitiveType, CONSTANT_TypeVariable (or CONSTANT_Hole), and probably CONSTANT_ReferenceClassType. All are referred to as "type constants" and can be referenced where types are expected.

A nice bonus is that, in defining these constants, we've also provided an encoding for types with nested type variables.

4) Introduce a CONSTANT_Type that mimics Class constants by pointing to a descriptor string.

This looks a lot like (2), above, other than the fact that we're explicitly distinguishing between type uses and class/interfaces uses, so we're not adding a bunch of new syntax to CONSTANT_Class, and CONSTANT_Type does't have to deal with the legacy of non-descriptor strings (e.g., 'I' means 'int').

Like (2), it requires something new to deal with type variables. We can use the same strategies (constant pool pointers in Utf8 entries, Type constants that point to *either* Utf8 entries or [some new thing]), or we can make Type the "new thing", supporting some sort of tree structure accompanying a descriptor string.

Some concrete issues for splitting strategies (both (3) and (4)):

- We've still got a legacy of Class constant references from type-flavored contexts. These represent either L types or array types. Do we try to migrate the world to use type constants instead?

- Again, we might want to revisit "naked" descriptor references, allowing them to point to either bare Utf8 entries or type constants and MethodType/[something else] constants. Do we try to migrate the world away from naked descriptor references?

---

I think that pretty well covers the design space. I'm interested in opinions on which strategy seems like the best fit, a sense of our appetite (as a community) for sweeping changes that affect how every class file gets compiled, any technical constraints that might push us in a particular direction, and any ideas/problems I left out.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.java.net/pipermail/valhalla-spec-experts/attachments/20200602/cd6d4eea/attachment-0001.htm>