Javac Tree API suggestions to more closely model source code

Wed Oct 1 00:57:25 UTC 2014

Harry, Jan,

It is true that the AST as presented through the com.sun.source API does 
not fully represent the original input stream -- even allowing for the 
transformations that occur within the parser, such as noted in 
JDK-8024098. This is unlikely to change any time soon, and it is a 
standard response that some of the missing information can be inferred 
from other sources, like the original source text, or lexer tokens.

But it is also true that the javac or the com.sun.source API does not 
provide anything in between source text and an AST.

It would be interesting to explore what more could be provided in this 
general area.

-- Jon

On 09/30/2014 05:10 AM, Jan Lahoda wrote:
> Hello Harry,
>
> Thanks for your feedback!
>
> On 30.9.2014 03:19, Harry Terkelsen wrote:
>> Hi Javac compiler devs,
>>
>> I am writing a Java formatter that makes use of the com.sun.source.tree
>> API as well as the Javac lexer. I see that there is some interest in
>> making the AST more closely model the actual source code
>> (https://bugs.openjdk.java.net/browse/JDK-8024098). My formatter only
>> changes the whitespace in between tokens, and so must have a completely
>> accurate view of the original source code. Since I've written a pretty
>> large application with the Tree API that must have an accurate picture
>> of the original code, I have compiled a list of difficulties I had with
>> the AST/lexer:
>>
>>
>> Bugs:
>> The lexer com.sun.tools.javac.parser.JavaTokenizer has very useful
>> protected methods processComment(), processWhitespace(), and
>> processLineTerminator() which can be overridden for lexers that care
>> about preserving comments and whitespace. However, none of these 3
>> useful methods are called immediately before the EOF. For example:
>>
>> class A {} //comment
>>
>> processComment() will not be called in this case (nor will
>> processWhitespace() for the space before the comment).
>
> I see the processComment is not called when the (line) comment is the 
> very last token of the input, but processWhitespace() and 
> processLineTerminator() seem to be called?
>
> The processComment is skipped presumably to avoid unnecessary 
> processing, as the JavaTokenizer attaches the comments to the 
> following important token, and for comment at the very end of the 
> input there will never be a token to which it should be attached. I'll 
> investigate if this can be changed.
>
> Please note that JavaTokenizer is not a (supported/public) API.
>
>>
>> Major pain points:
>> The AST represents multivariable declarations ("int x, y;") as two
>> separate variable declaration statements. This forced my to preprocess
>> all statement lists, and combine consecutive variable declarations into
>> one if SourcePositions#getStartPosition() was the same for the type node
>> of the consecutive variable declarations.
>
> Yes, I believe this is a known problem. I've added a note into 
> JDK-8024098.
>
>>
>> The ModifiersTree has pretty much no relationship to the original source
>> code. Implicit non-annotation modifiers are added to the tree by the
>> parser. Repeated modifiers are ignored. There is no way to tell the
>> original order of the modifiers. For annotations, you know the original
>> order they came in, but you don't know how they are interspersed with
>> the modifiers.
>
> While I understand why you need this information, my personal 
> inclination would be that this should not be part of the AST as such, 
> but should ideally be inferred from other sources (lexer tokens).
>
>>
>> Enum constants are desugared into variable declarations (with added
>> implicit modifiers). Enum constants are represented as a VariableTree
>> with the RHS being a NewClassTree, but you have to know which parts of
>> the NewClassTree to pick out to reconstruct the original source of the
>> enum constant. The only way to differentiate between an enum constant
>> and an actual variable declaration inside the enum body is to cast to
>> JCTree, which feels very hacky:
>>        JCVariableDecl v = (JCVariableDecl) node;
>>        return (v.mods.flags & ENUM) != 0;
>
> Yes, too early desugaring of enums is a known problem (see JDK-8024098).
>
>>
>> Unary expressions involving literals are sometimes combined into just a
>> literal, and sometimes they aren't. I haven't looked into this very
>> deeply, but sometimes unary expressions with numeric literals will be
>> represented as just a numeric literal, for instance "-1" will sometimes
>> be UnaryTree and sometimes be LiteralTree. If it is LiteralTree the
>> source position will be wrong, it will not include the "-". I guess this
>> is to overcome the problem with -2147483648.
>
> Yes, -2147483648 is the problem. Sorry, but I don't expect that 
> -<decimal-integer-number> would be changed to an 
> UnaryTree+LiteralTree. I checked the positions for "int i = -1" (and a 
> few similar cases), and the positions seemed reasonable to me. Is 
> there a sample code where the positions are wrong?
>
>>
>>
>> Moderate pain points:
>> Class, Enum, Interface, and annotation type declarations are all
>> represented with ClassTree. Since the Kind is set for these it turns out
>> not to be that big of a problem to disambiguate.
>
> Sorry, I don't expect this would be changed. The Kind is the correct 
> way to distinguish between the class, enum, interface and annotation 
> type. (As a side note, instanceof is not a reliable way to check the 
> actual type of the tree - Tree.Kind should be used to correctly detect 
> the nature of the given tree).
>
>>
>> Cannot tell if an argument VariableTree is varargs. This required
>> casting to JCTree, which feels hacky:
>>        JCModifiers mods = (JCModifiers) node.getModifiers();
>>        return (mods.flags & VARARGS) != 0;
>>
>>
>> Minor pain points (small concrete syntax things):
>
> Overall, I would incline to not including things like this in the AST 
> as such. Ideally, there would be a way to infer them from other sources.
>
>> No difference between @Annotation() and @Annotation
>
> Comparing the end positions of "@Annotation()" and "Annotation" may 
> allow to distinguish between "@Annotation()" and "@Annotation".
>
>> Cannot tell if there is a trailing comma in array literal
>> Cannot tell if there is trailing comma or semicolon after enum constants
>> Cannot tell where array brackets are on variable declarations: int[] x
>> vs int x[], int[] foo() vs int foo()[]
>>
>>
>> Hope this helps!
>
> Thanks!
>
> Jan
>
>> Harry