Javac Tree API suggestions to more closely model source code

Tue Sep 30 01:19:36 UTC 2014

Hi Javac compiler devs,

I am writing a Java formatter that makes use of the com.sun.source.tree API
as well as the Javac lexer. I see that there is some interest in making the
AST more closely model the actual source code (
https://bugs.openjdk.java.net/browse/JDK-8024098). My formatter only
changes the whitespace in between tokens, and so must have a completely
accurate view of the original source code. Since I've written a pretty
large application with the Tree API that must have an accurate picture of
the original code, I have compiled a list of difficulties I had with the
AST/lexer:

Bugs:
The lexer com.sun.tools.javac.parser.JavaTokenizer has very useful
protected methods processComment(), processWhitespace(), and
processLineTerminator() which can be overridden for lexers that care about
preserving comments and whitespace. However, none of these 3 useful methods
are called immediately before the EOF. For example:

class A {} //comment

processComment() will not be called in this case (nor will
processWhitespace() for the space before the comment).

Major pain points:
The AST represents multivariable declarations ("int x, y;") as two separate
variable declaration statements. This forced my to preprocess all statement
lists, and combine consecutive variable declarations into one if
SourcePositions#getStartPosition() was the same for the type node of the
consecutive variable declarations.

The ModifiersTree has pretty much no relationship to the original source
code. Implicit non-annotation modifiers are added to the tree by the
parser. Repeated modifiers are ignored. There is no way to tell the
original order of the modifiers. For annotations, you know the original
order they came in, but you don't know how they are interspersed with the
modifiers.

Enum constants are desugared into variable declarations (with added
implicit modifiers). Enum constants are represented as a VariableTree with
the RHS being a NewClassTree, but you have to know which parts of the
NewClassTree to pick out to reconstruct the original source of the enum
constant. The only way to differentiate between an enum constant and an
actual variable declaration inside the enum body is to cast to JCTree,
which feels very hacky:
      JCVariableDecl v = (JCVariableDecl) node;
      return (v.mods.flags & ENUM) != 0;

Unary expressions involving literals are sometimes combined into just a
literal, and sometimes they aren't. I haven't looked into this very deeply,
but sometimes unary expressions with numeric literals will be represented
as just a numeric literal, for instance "-1" will sometimes be UnaryTree
and sometimes be LiteralTree. If it is LiteralTree the source position will
be wrong, it will not include the "-". I guess this is to overcome the
problem with -2147483648.

Moderate pain points:
Class, Enum, Interface, and annotation type declarations are all
represented with ClassTree. Since the Kind is set for these it turns out
not to be that big of a problem to disambiguate.

Cannot tell if an argument VariableTree is varargs. This required casting
to JCTree, which feels hacky:
      JCModifiers mods = (JCModifiers) node.getModifiers();
      return (mods.flags & VARARGS) != 0;

Minor pain points (small concrete syntax things):
No difference between @Annotation() and @Annotation
Cannot tell if there is a trailing comma in array literal
Cannot tell if there is trailing comma or semicolon after enum constants
Cannot tell where array brackets are on variable declarations: int[] x vs
int x[], int[] foo() vs int foo()[]

Hope this helps!
Harry
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/compiler-dev/attachments/20140929/75b6b5e2/attachment.html>