What's in a CONSTANT_Class?
    John Rose 
    john.r.rose at oracle.com
       
    Wed Jun  7 19:53:19 UTC 2017
    
    
  
(From today's discussion internally and with IBM.)
Dan Smith and Maurizio point out that a C_Class CP entry has many uses.
Some of them are type-like, and some are file-like.
Example file-like uses are this_class, InnerClasses, EnclosingClass (refer to a class-file).
Example type-like uses are ldc (for an arbitrary jl.Class mirror), Fieldref (for now, assume no distinct vgetfield)
Definitions: A _verifier type_ is the ordered pair of a class name and a usage mode (or kind).
The class name is a C_Utf8 such as underlies a C_Class.  The mode is one of {L,Q,U} aka {ref,val,any}.
A _class-file_ is a singular body of bytecodes and metadata that translates (a portion of) a source file.
Two or three types with the same name might be loaded from one file, because of mode distinctions.
(Also in the future, many param-type species from a single template file.)
(Also, we could refactor array types and/or method descriptors as derived from nested types.  See Pack200.)
More file-like uses:  The head (template name) of a param-type species.
More type-like uses:  super_class/interfaces (extending a param-type species), annotation,
  catch_type, Exceptions, new, instanceof/checkcast.
Many of the the type-like uses only make sense with L-mode (reference).  Obviously, since
today's JVM does not support Q/U modes, the question doesn't even come up.  In L-only cases,
we *could* say that the type-like use can refer to a file-like CP constant node, with an proviso
that the CP node has a default mode of L-mode.  That's not clean but might be desirable to ease
adoption and backward compatibility.
Which type-like uses extend to other modes?  The poster child is getfield[FieldRef[recv,NT[…]]
where the mode of the getfield (Q/L/U) determines the type of the receiver on the stack, so must
be fully explicit (long before any class-file is loaded).  The "recv" substructure of this getfield must
carry the mode information.
(Quick aside:  We could carry the mode information in the bytecode only.  There are two
objections to this:  First, bytecode points are scarce and so we prefer overloading existing
code points.  Second, CP nodes are important cache points for information which quickens
bytecode execution.  If the mode information is *only* in the bytecode, it follows that the
quickening resources on the Fieldref node must serve *all modes*, which is potentially
an implementation challenge.  This is true even if we end up aligning the per-mode
layouts as much as possible, which for other reasons seems desirable.)
Another poster child for multi-mode type-like uses is ldc-of-jl.Class.  Plan of record is
to have one jl.Class mirror per mode, even though that means more than one per file.
(See forthcoming note about "secondary class mirrors".)  Given that jl.Class has this
type-oriented structure (not file-oriented), shouldn't CONSTANT_Class have the
same structure?  Maybe.  But this might also be the tail wagging the dog:  ldc-of-class
is less fundamental to JVM operation than Fieldref; it is a relatively recent introduction.
Going back to mode representation, there are several ways to make the mode
information available to the JVM bytecode that performs a getfield:
 1. Wrap a new CP node (a "mode node") around the file-oriented C_Class node - Q[Class["Foo"]]
 2. Insert a new CP node inside the type-oriented C_Class node - Class[Q["Foo"]] or Class[Q[File["Foo"]]]
 3. Use a different C_Class node per mode, distinguished by name mangling - Class[";QFoo;"]
 4. Use a different modal bytecode with the same CP node - vgetfield(some F) vs. getfield(the same F)
 5. Use a different file per mode, with the mode available after the file is loaded
There are approximately in order of preference.  The last one is a no-go because
it requires the verifier to load class files before verifying, which leads to vicious
bootstrapping loops.  Option 4 burns more code-points and not enough CP nodes.
Options 1-3 are the options where the CP structure contains the modal information.
There are two problems with option 3, using mangled names.  First, it means that
a single class file might have two CP nodes that equally refer to it, which leads to
potential resolution bugs (and extra resolution work).  As a principle of CP design,
VM engineers would prefer that each resolvable reference to a named class file
reside in a unique CP node, which then provides the cache point for resolution
that other nodes derive their needed information from.  This is a desirable property
of the current JVM design we wish to keep.  Second, requiring the system to
demangle strings (";L…") to derive mode information will make it a little slower
and buggier; CP tags are a more central way of carrying mode information.
(There are two major counter-examples to the "one resolution site" principle:
An array type constant of the form Class["[LFoo;"] resolves the class name "Foo".
Something like ArrayClass[Class["Foo"]] would be closer to the one-site principle.
Even worse, MethodType["(LFoo;)LBar;"] can have many class references.  Again,
it could be something like MethodType["(L)L", Class[Foo], Class[Bar]].  Maybe we
can get closer to this ideal later, as Pack200 does.  For now it is enough to note that
precedent against one-site resolution exists but need not drive future design.)
(Arrays are a counterexample to the principle of "use tags not mangling".  Again,
that choice need not drive the future design.  Arguably it caused bugs; take a look
at the toString method on an array or the getName method on an array class,
and VM engineers could tell stories of struggling with arrays in the early days.)
The remaining question is whether 1 or 2 is better:  Should we wrap a mode
node around a CONSTANT_Class, or should the Utf8 string of a Class be
replaced with a different node type that carries mode information?  From a
CP-centric point of view, the first option (Q[Class["Foo"]]) seems more natural.
But this pushes C_Class to the "file" role rather than the "type" role, which
causes problems for "ldc" and perhaps other use cases.
If someone has "dual citizenship" between the CP world and the reflective
world (in the JDK) then surely the cognitive dissonance between
CONSTANT_Class and java.lang.Class will grow.  There are two reasons
this is not a primary design-driving consideration:  First, only a few folks
are aware of both "worlds".  Second, we are dealing with the original choice
to use the word "class" for many concepts that in hindsight are distinct.
(As I like to say, "lumping" is a more Java-like design move than "splitting".)
Bringing in new modes at the reflective level has a natural fix in terms
of pseudo-classes like int.class (vs. Integer.class, a real class), which
is already out of phase with CONSTANT_Class.
So the option 1 proposal looks something like this:
1. Add a new CP node type to wrap around Class[Utf8["Foo"]] to denote Q-Foo.
Straw men: CONSTANT_Value[Class], CONSTANT_QMode[Class],
CONSTANT_Mode['Q', Class], CONSTANT_Type[Utf8["Q"], Class].
2. Anticipate U-mode and param-type as likely siblings to this design.
3. Anticipate the possibility of an L-mode sibling for symmetry.
4. Use a QMode node where a naked Class would otherwise imply L-mode.
5. Continue to use a naked Class where L-mode is unambiguous.
This seems like a reasonable short-term experiment.  It is likely there
are downsides to it which we will encounter as we experiment with it.
The implication is that a naked Class node means mainly the file, but
if you press it into service as a type, it sprouts the L-mode.
What about option 2, where Class nodes are *always* types, and only
secondarily refer to files?  It is more tricky than in option 1 to preserve
the "one file resolution site" design feature, since there would be several
Class nodes for one file.  We could address this straight-on by adding a
CONSTANT_ClassFile node, and deprecating the CONSTANT_Class
node as a carrier for a file-only reference.  (This would impact this_class,
InnerClasses, and other file-only uses of constants.)  In the general
non-legacy case, the substructure of CONSTANT_Class would have
to include both a ClassFile and some mode information.  A proposal
for that might look like this:
1. Add a new CP node type File[Utf8] to be the resolution of a class file.
2. Add a new binary CP node ModeAndFile (like NameAndType) to
carry both a mode and a file reference.
3. Allow (eventually require?) the Utf8 of a Class node to be replaced
by a ModeAndFile:  Class[ModeAndFile[Utf8["Q"],File[Utf8[""]]].
4. Anticipate a variety of modes (Q/L/U) in the ModeAndFile structure,
including perhaps array and param-type species syntaxes.
5. Use Class nodes wherever types are required.
6. For compatibility, allow an abbreviated form Class[Utf8], at least
when it is the only reference to a class file in a CP.
(7. Bonus:  Maybe File is also useful as a reference to a resource
file?  We do need a way to import blocks of bit-data into CPs.
Current thinking is that inlining the bits like Utf8 is good enough.)
Comparing these options in detail makes me comfortable with
declaring that a CONSTANT_Class is *mainly* a file reference,
and *also* an L-mode type.  That is, it seems OK to go with
option 1 in the Minimal Value Type time frame, and even the
long term, until or unless we realize that option 2 (or some
undiscovered option) is better.
I should also say that Dan Smith, in writing the JVM spec. for this,
is producing additional evidence that points toward option 2, "Class
is a type not a file".  So we may well pivot in that direction after
our MVT experience is done.
— John
    
    
More information about the valhalla-spec-observers
mailing list