Improving the format of type annotation attributes
Alex Buckley
alex.buckley at oracle.com
Thu Sep 13 12:31:04 PDT 2012
Experts,
John Rose at Oracle owns the Pack200 spec and has generously given of
his time to analyze JSR 308's ClassFile attributes from a Pack200 point
of view. His analysis is below. Please review it carefully.
With regard to target_type, the change from u1 to u2 was my fault. I was
concerned that the number of contexts in the Java language where
annotations may appear will grow significantly in future, e.g. for
annotations on expressions. However, it seems likely that a u1 will
suffice to classify contexts, as indeed a u1 suffices for JVM bytecodes.
target_type will therefore be changed back to a u1.
Then we must make a choice. The low-risk approach is simply to move the
location information earlier in the type_annotation structure. The
high-risk approach is to drop the "counted" approach to location
information and adopt the "nested" location approach.
In either case, the physical content in type_annotations will be hidden
by the reflection APIs I proposed recently. Annotation consumers will
never see a location array _or_ a type_location tree. OTOH, it is fairly
clear that a strongly-typed type_location tree is more "beautiful" than
an untyped location array. Readability and ease of understanding are as
important for ClassFile attributes as for Java language features.
Your comments on which approach to take will be most welcome.
Alex
--- Mail from John Rose follows ---
I just read the updated type annotation spec (dated 7/03/2012) here:
http://types.cs.washington.edu/jsr308/specification/java-annotation-design.html
I also read the change log.
On the whole, the project is clearly converging toward a good design.
The annotation format is cleaner, as is the type annotation syntax. The
new receiver syntax is a good simplifying stroke, in my opinion, and the
detailed disambiguation rules about whole-method vs. return type
annotations feel clean and user-friendly.
In the light of the updated design, I'd like adjust the comments I made
last November.
1. I am still not certain whether Pack200 can handle the new attribute
layout proposed by JSR 308. In order to test this in a timely manner,
we will need a good sample of class files that exercises most or all the
degrees of freedom in the new attribute format. I think the best way
for us to get this is to piggy-back on whatever unit tests you have,
perhaps in the jsr308-langtools repository. Please advise.
2. The change of target_type from u1 to u2 seems like premature
optimization to me. The class file format is so relentlessly
byte-oriented that it is very hard to make use of any larger alignment.
In my 15 years of working with class files, I think I have seen only a
few cases where a regular u2 type was technically helpful to class file
parsing. I don't think JSR 308 will be one of those cases, because not
every element of the layout has an even byte count (parameter_index and
typearg_target are examples). Regarding future extensions, I don't
think the extra 8 bits matters even slightly: You are using 18*2 code
points out of a possible 256, which leaves 220 remaining, and in the
unlikely event that you need up to 50,000 new code points, you could
define a secondary target_type_2 field of 8 bits, present only when the
original target_type is not one of the original allowed values.
It seems to me that adding those extra 8 bits to every type annotation
does not buy any tangible benefit. (Pack200 will make them disappear,
but they will make class files slightly larger.)
3. I see that you still use the odd/even trick to handle nesting;
essentially the low bit of the target_type field gates whether or not
there is a trailing location record.
I would like to critique the location record format, which records a
tree branch within an occurrence of a complex type expression tree:
1. The length (tree depth) field is absurdly long at 16 bits. No type
expression tree has ever been so deep.
2. It is a "design smell" that the location_length field is never zero.
The natural "zero condition" is redundantly encoded elsewhere, making
for a confusing design.
3. The nodes in the tree branch are unlabeled. This means that the
source code must be inspected to determine whether the location is
within a wildcard, array type, etc.
4. Because the nodes in the branch are unlabeled, the 8-bit location
index must carry *either* branch depth *or* parameter index information;
in the case of nested types it refers to *both* kinds of information.
This leads to the fragile rule about imputing type parameters to erased
types.
5. The location array and its length are orphan fields, not belonging to
any particular structure. This requires an informal "hand wavey"
account of where they occur, which makes the design harder to understand
and implement.
Using a nested instead of a counted branch location format would remedy
flaws 1, 2, and 5, and would also remove the strange (from 2) asymmetry
by which zero-length branches are denoted by the even/odd hack instead
of by a zero count. By "nested" I mean the format called
"nested_target" below, where each node in the type tree branch is
introduced by a label (wildcard, array, nested type, type bound) and an
index (which could be elided for arrays).
I understand that even if a nested location format is theoretically the
best, the weight of existing implementations is committed to the counted
location format, and it would be awkward to change. In this case, I
would like to suggest a more modest change that would be slightly more
compact than your present design, and would be simpler to describe and
to parse by tools, including Pack200.
Specifically: Reduce the target_type field to 8 bits. Add the location
array immediately after the target_type field, with an 8-bit count
(usually zero). Put the polymorphic part of the target after both. The
result would be to move the low bit of the target_type into its own
byte, which would serve as the location array count (if non-zero, of
course). The new JSR 308 fields would then look like this:
u1 target_type; // the type of the targeted program element, see
Section 3.2
u1 location_length; // ++ADDED
u1 location[location_length]; // ++ADDED
union {
typeparam_target;
supertype_target;
typeparam_bound_target;
empty_target;
methodparam_target;
exception_target;
localvar_target;
catch_target;
offset_target;
typearg_target;
} target_info; // uniquely identifies the targeted program
element, see Section 3.3
The odd/even hack would be unnecessary in this case, although it could
be present redundantly in the target_type field. In that case, the
invariant would apply that the low bit of target_type is zero if and
only if the location_length byte is zero.
Note that in this design the location_length and location array are no
longer disembodied fields tacked onto a bunch of other layouts, but are
first-class members of the type_annotation structure. They could also
be moved after the target_info structure (which is where they are now,
de facto), although I think they are slightly easier to work with in the
place suggested above.
I guess my minimum request, from a viewpoint centered only on Pack200,
is to consider making the location_length field be either non-optional,
or else to have it gated by a smaller number of target_type tag values
(not the full 18, more like 1-4, as described below). But I believe
that, given the importance of type annotations as a technology for
enhancing static type checkers, it is worth all of our time to remove as
many design smells as possible.
Best wishes,
— John Rose
P.S. For completeness, here is what a fully nested (non-counted) layout
would look like, fixing problems 3 and 4 also:
// New field in JSR 308:
type_location location;
}
struct type_location { // ++ADDED recursive structure
u1 target_type; // the type of the targeted program element, see
Section 3.2
union {
typeparam_target;
supertype_target;
typeparam_bound_target;
empty_target;
methodparam_target;
exception_target;
localvar_target;
catch_target;
offset_target;
typearg_target;
// ++ADDED four more target_type values
// (but deleted the 18 odd ones from the July spec):
array_component_target = struct {
type_location outer_array_type; };
type_argument_target = struct {
u1 parameter_index; type_location outer_type_expression; };
nested_type_qualifier_target = struct {
u1 qualifier_index; type_location qualified_type; };
wildcard_bound_target = struct {
type_location outer_wildcard_expression; };
} target_info; // uniquely identifies the targeted program
// element, see Section 3.3
}
The type_location structure is recursive in a way that naturally
corresponds to the various location syntaxes. Note that only type
parameter edges absolutely require an extra index. The others can
select the proper level of structure (in array rank or nested type
pathname) simply by repeating edges; I kept qualifier index to show the
other option. The ordering of data in this scheme is the reverse of
that with a counted location, since the outermost location (such as a
field type) must come last, since its target_info is non-recursive.
Each occurrence of array_component_target selects a nested component
type (of exactly one less rank) from the enclosing array type. Each
occurrence of nested_type_qualifier selects one enclosing (qualifying)
type, selected by the qualifier index (origin and direction not
specified here).
More information about the type-annotations-spec-observers
mailing list