Javac Tree API suggestions to more closely model source code

Tue Sep 30 12:10:19 UTC 2014

Hello Harry,

Thanks for your feedback!

On 30.9.2014 03:19, Harry Terkelsen wrote:
> Hi Javac compiler devs,
>
> I am writing a Java formatter that makes use of the com.sun.source.tree
> API as well as the Javac lexer. I see that there is some interest in
> making the AST more closely model the actual source code
> (https://bugs.openjdk.java.net/browse/JDK-8024098). My formatter only
> changes the whitespace in between tokens, and so must have a completely
> accurate view of the original source code. Since I've written a pretty
> large application with the Tree API that must have an accurate picture
> of the original code, I have compiled a list of difficulties I had with
> the AST/lexer:
>
>
> Bugs:
> The lexer com.sun.tools.javac.parser.JavaTokenizer has very useful
> protected methods processComment(), processWhitespace(), and
> processLineTerminator() which can be overridden for lexers that care
> about preserving comments and whitespace. However, none of these 3
> useful methods are called immediately before the EOF. For example:
>
> class A {} //comment
>
> processComment() will not be called in this case (nor will
> processWhitespace() for the space before the comment).

I see the processComment is not called when the (line) comment is the 
very last token of the input, but processWhitespace() and 
processLineTerminator() seem to be called?

The processComment is skipped presumably to avoid unnecessary 
processing, as the JavaTokenizer attaches the comments to the following 
important token, and for comment at the very end of the input there will 
never be a token to which it should be attached. I'll investigate if 
this can be changed.

Please note that JavaTokenizer is not a (supported/public) API.

>
> Major pain points:
> The AST represents multivariable declarations ("int x, y;") as two
> separate variable declaration statements. This forced my to preprocess
> all statement lists, and combine consecutive variable declarations into
> one if SourcePositions#getStartPosition() was the same for the type node
> of the consecutive variable declarations.

Yes, I believe this is a known problem. I've added a note into JDK-8024098.

>
> The ModifiersTree has pretty much no relationship to the original source
> code. Implicit non-annotation modifiers are added to the tree by the
> parser. Repeated modifiers are ignored. There is no way to tell the
> original order of the modifiers. For annotations, you know the original
> order they came in, but you don't know how they are interspersed with
> the modifiers.

While I understand why you need this information, my personal 
inclination would be that this should not be part of the AST as such, 
but should ideally be inferred from other sources (lexer tokens).

>
> Enum constants are desugared into variable declarations (with added
> implicit modifiers). Enum constants are represented as a VariableTree
> with the RHS being a NewClassTree, but you have to know which parts of
> the NewClassTree to pick out to reconstruct the original source of the
> enum constant. The only way to differentiate between an enum constant
> and an actual variable declaration inside the enum body is to cast to
> JCTree, which feels very hacky:
>        JCVariableDecl v = (JCVariableDecl) node;
>        return (v.mods.flags & ENUM) != 0;

Yes, too early desugaring of enums is a known problem (see JDK-8024098).

>
> Unary expressions involving literals are sometimes combined into just a
> literal, and sometimes they aren't. I haven't looked into this very
> deeply, but sometimes unary expressions with numeric literals will be
> represented as just a numeric literal, for instance "-1" will sometimes
> be UnaryTree and sometimes be LiteralTree. If it is LiteralTree the
> source position will be wrong, it will not include the "-". I guess this
> is to overcome the problem with -2147483648.

Yes, -2147483648 is the problem. Sorry, but I don't expect that 
-<decimal-integer-number> would be changed to an UnaryTree+LiteralTree. 
I checked the positions for "int i = -1" (and a few similar cases), and 
the positions seemed reasonable to me. Is there a sample code where the 
positions are wrong?

>
>
> Moderate pain points:
> Class, Enum, Interface, and annotation type declarations are all
> represented with ClassTree. Since the Kind is set for these it turns out
> not to be that big of a problem to disambiguate.

Sorry, I don't expect this would be changed. The Kind is the correct way 
to distinguish between the class, enum, interface and annotation type. 
(As a side note, instanceof is not a reliable way to check the actual 
type of the tree - Tree.Kind should be used to correctly detect the 
nature of the given tree).

>
> Cannot tell if an argument VariableTree is varargs. This required
> casting to JCTree, which feels hacky:
>        JCModifiers mods = (JCModifiers) node.getModifiers();
>        return (mods.flags & VARARGS) != 0;
>
>
> Minor pain points (small concrete syntax things):

Overall, I would incline to not including things like this in the AST as 
such. Ideally, there would be a way to infer them from other sources.

> No difference between @Annotation() and @Annotation

Comparing the end positions of "@Annotation()" and "Annotation" may 
allow to distinguish between "@Annotation()" and "@Annotation".

> Cannot tell if there is a trailing comma in array literal
> Cannot tell if there is trailing comma or semicolon after enum constants
> Cannot tell where array brackets are on variable declarations: int[] x
> vs int x[], int[] foo() vs int foo()[]
>
>
> Hope this helps!

Thanks!

Jan

> Harry