From eric.mccorkle at oracle.com Fri Jan 9 19:26:07 2015 From: eric.mccorkle at oracle.com (eric.mccorkle at oracle.com) Date: Fri, 09 Jan 2015 19:26:07 +0000 Subject: hg: compiler-grammar/jls8: Initial changes integrating antlr parser into build system Message-ID: <201501091926.t09JQ8QV028720@aojmv0008> Changeset: 46d7c54f212c Author: emc Date: 2015-01-09 14:23 -0500 URL: http://hg.openjdk.java.net/compiler-grammar/jls8/rev/46d7c54f212c Initial changes integrating antlr parser into build system ! common/autoconf/configure.ac ! common/autoconf/generated-configure.sh ! common/autoconf/jdk-options.m4 ! common/autoconf/libraries.m4 ! common/autoconf/spec.gmk.in ! make/CompileJavaModules.gmk ! make/common/SetupJavaCompilers.gmk From eric.mccorkle at oracle.com Fri Jan 9 19:26:24 2015 From: eric.mccorkle at oracle.com (eric.mccorkle at oracle.com) Date: Fri, 09 Jan 2015 19:26:24 +0000 Subject: hg: compiler-grammar/jls8/langtools: Interim patch fixing some parser bugs and integrating antlr parser Message-ID: <201501091926.t09JQORa028781@aojmv0008> Changeset: b7e4d595cab2 Author: emc Date: 2015-01-09 14:24 -0500 URL: http://hg.openjdk.java.net/compiler-grammar/jls8/langtools/rev/b7e4d595cab2 Interim patch fixing some parser bugs and integrating antlr parser ! make/gensrc/Gensrc-jdk.compiler.gmk ! src/jdk.compiler/share/classes/com/sun/tools/javac/antlr/AntlrParserFactory.java ! src/jdk.compiler/share/classes/com/sun/tools/javac/antlr/Java.g4 ! src/jdk.compiler/share/classes/com/sun/tools/javac/antlr/JavaTranslator.java ! src/jdk.compiler/share/classes/com/sun/tools/javac/jvm/Gen.java From eric.mccorkle at oracle.com Tue Jan 20 17:38:56 2015 From: eric.mccorkle at oracle.com (eric.mccorkle at oracle.com) Date: Tue, 20 Jan 2015 17:38:56 +0000 Subject: hg: compiler-grammar/jls8/langtools: Used lazy evaluation technique to avoid exponential tree walks. Message-ID: <201501201738.t0KHcuE3004740@aojmv0008> Changeset: c4b51c1a84ba Author: emc Date: 2015-01-20 12:37 -0500 URL: http://hg.openjdk.java.net/compiler-grammar/jls8/langtools/rev/c4b51c1a84ba Used lazy evaluation technique to avoid exponential tree walks. ! src/jdk.compiler/share/classes/com/sun/tools/javac/antlr/JavaTranslator.java From eric.mccorkle at oracle.com Thu Jan 22 15:10:10 2015 From: eric.mccorkle at oracle.com (eric.mccorkle at oracle.com) Date: Thu, 22 Jan 2015 15:10:10 +0000 Subject: hg: compiler-grammar/jls8/langtools: Various bug fixes. Compiler can now finish 'make', but 'make all' hangs. Message-ID: <201501221510.t0MFAFKl021015@aojmv0008> Changeset: a4835d398c3c Author: emc Date: 2015-01-22 10:09 -0500 URL: http://hg.openjdk.java.net/compiler-grammar/jls8/langtools/rev/a4835d398c3c Various bug fixes. Compiler can now finish 'make', but 'make all' hangs. ! src/jdk.compiler/share/classes/com/sun/tools/javac/antlr/JavaTranslator.java From eric.mccorkle at oracle.com Thu Jan 22 16:20:50 2015 From: eric.mccorkle at oracle.com (eric.mccorkle at oracle.com) Date: Thu, 22 Jan 2015 16:20:50 +0000 Subject: hg: compiler-grammar/jls8: Increase stack and memory Message-ID: <201501221620.t0MGKoo6003379@aojmv0008> Changeset: 95d07755cad4 Author: emc Date: 2015-01-22 11:20 -0500 URL: http://hg.openjdk.java.net/compiler-grammar/jls8/rev/95d07755cad4 Increase stack and memory ! common/autoconf/boot-jdk.m4 ! common/autoconf/generated-configure.sh From eric.mccorkle at oracle.com Thu Jan 22 18:05:39 2015 From: eric.mccorkle at oracle.com (eric.mccorkle at oracle.com) Date: Thu, 22 Jan 2015 13:05:39 -0500 Subject: Parser generator actions Message-ID: <54C13BF3.4080903@oracle.com> The following is a brief comparison of three techniques for specifying the actions performed by a parser generator in response to the input. This is motivated by the question of how best to maintain the ANTLR parser going forward. == Background == Lexer and Parser generators have proven themselves to be a useful tool for compiler development. The first and perhaps best-known set of tools, lex and yacc, are still in use today, and similar tools have been written for new languages (ml-lex/ml-yacc, ANTLR, alex/happy, and others). These tools are useful for a number of reasons, chief among them being that they allow developers to work directly on the grammar as opposed to writing a parsing procedure. Another key part of parser generators deals with how the generated parser produces results for the rest of the compiler. The following sections will cover three techniques, each of which is available in ANTLRv4 == Inline Actions == Inline actions are what most users of yacc-style tools are used to seeing, as they are the /only/ way to perform actions in most such tools. Inline actions embed fragments of code into the grammar itself. The following is an example: sum: sum '+' prod { return $1 + $3; } | prod { return $1; } ; prod: atom '*' prod { return $1 * $3; } | atom { return $1; } ; atom: '(' sum ')' { return $2; } | literal { return Integer.parseInt($1); } ; The actions enclosed in the parentheses are performed whenever the parser reduces by the given rule, and the pseudo-variables $n refer to the nth element of the rule. This method has the advantage that it has an implicit pattern-matching. Each action knows exactly which rule it's responding to, how many elements there are and what their types are, and so on. For example, in the rules for prod, we have separate cases for a binary product and a single atom. The disadvantage of this method is that it prevents the grammar itself from being re-usable. A common technique when using parser generators this way is either to publish a "bare" grammar, which programmers copy and decorate with their own actions. == Parse Trees == ANTLRv4 introduces two new approaches aimed to eliminate the need to write inline parser rules in the grammar, the goal being that the grammar description can be made independent of the actions performed by parsers. The first of these techniques is to have the parser emit a parse tree, where every terminal and nonterminal is represented by a node. Developers then write a visitor, which walks the parse tree and converts it into an AST. This does have the advantage of separating the actions from the grammar, but it also has a number of disadvantages. First, parse trees are a rather large data structure, which is produced and then immediately converted into an AST. This leads to high memory use and slower parsing times. Second, the visitor code does not benefit from the pattern-matching behavior of the inline rules. In the example above, the visitor would have one visitProd method for /both/ productions of prod; it would then need to analyze the tree to determine which production had indeed led to the node. For grammars for real languages, this can lead to complicated and delicate code. It is also very easy to make a mistake that leads to exponential translation times. Consider the following excerpt from the java grammar: primaryNoNewArray : literal | primOrTypeName ('[' ']')* '.' 'class' | 'this' | typeName '.' 'this' | '(' expression ')' | primaryNoNewArray '.' 'new' typeArguments? annotation* Identifier typeArgumentsOrDiamond? '(' argumentList? ')' classBody? | 'new' typeArguments? annotation* Identifier ('.' annotation* Identifier)* typeArgumentsOrDiamond? '(' argumentList? ')' classBody? | expressionName '.' 'new' typeArguments? annotation* Identifier typeArgumentsOrDiamond? '(' argumentList? ')' classBody? | arrayCreationExpression '.' 'new' typeArguments? annotation* Identifier typeArgumentsOrDiamond? '(' argumentList? ')' classBody? | primaryNoNewArray '.' Identifier | arrayCreationExpression '.' Identifier | 'super' '.' Identifier | typeName '.' 'super' '.' Identifier | primaryNoNewArray '[' expression ']' | expressionName '[' expression ']' | primaryNoNewArray '.' typeArguments? Identifier '(' argumentList? ')' | methodName '(' argumentList? ')' | typeName '.' typeArguments? Identifier '(' argumentList? ')' | expressionName '.' typeArguments? Identifier '(' argumentList? ')' | arrayCreationExpression '.' typeArguments? Identifier '(' argumentList? ')' | 'super' '.' typeArguments? Identifier '(' argumentList? ')' | typeName '.' 'super' '.' typeArguments? Identifier '(' argumentList? ')' | primaryNoNewArray '::' typeArguments? Identifier | expressionName '::' typeArguments? Identifier | referenceType '::' typeArguments? Identifier | arrayCreationExpression '::' typeArguments? Identifier | 'super' '::' typeArguments? Identifier | typeName '.' 'super' '::' typeArguments? Identifier | classType '::' typeArguments? 'new' | arrayType '::' 'new' ; As you can see, this rule is quite complex, and many of its productions have similar prefixes and/or suffixes. For many of the cases, we need to descend further down the tree in order to figure out which rule was reduced. Moreover, this rule is recursive, as it deals with invocation, member access, and array indexing. Because it is recursive, programmers must take great care to only ever visit a given subtree once, or else the visitor's runtime becomes exponential in the depth of the tree. In practice, it is quite easy to slip up and make this mistake (the prototype java ANTLR parser had at least one such bug). Moreover, the code that visits this parse tree node becomes quite complex, and it is easy to make other kinds of errors. The source of all these problems is that the parse tree visitor method discards the implicit pattern-matching. It bunches up cases into a single node, which then must be separated back out again. == Rule Listeners == ANTLRv4 provides another mechanism for performing actions, which captures the benefits of inline actions, while avoiding polluting the grammar. The visitor approach allows code that would be expressed in inline actions to be moved out to a listener class, whose methods are called when corresponding rules are reduced. Returning to our previous example, we would have a grammar looking like this: sum: sum '+' prod # SumBin | prod # SumOne ; prod: atom '*' prod # ProdBin | atom # ProdOne ; atom: '(' sum ')' # AtomParens | literal # AtomLiteral ; We would then have the following methods in our listener: public Integer visitSumBin(SumBinContext ctx) { return ctx.sum() + ctx.prod(); } public Integer visitSumOne(SumOneContext ctx) { return ctx.prod(); } public Integer visitProdBin(ProdBinContext ctx) { return ctx.prod() + ctx.atom(); } public Integer visitProdOne(ProdOneContext ctx) { return ctx.atom(); } public Integer visitAtomParens(AtomParensContext ctx) { return ctx.sum(); } public Integer visitProdOne(AtomLiteralContext ctx) { return Integer.parseInt(ctx.literal()); } This preserves the two key advantages of the inline rules: the implicit pattern matching done by the parser, and the direct construction of the AST. In the more complex rule shown in the visitor section, we still know the exact production we are seeing, and we know the exact number and types of the subtrees. Therefore, we avoid the complex and error-prone code, as well as the risk of exponential-time visitors that we had with the parse tree. == Summary == Rule listeners arguably are the best option, because they provide all of the benefits of inline rules, while avoiding pollution of the grammar files. Parse trees, by constrast, are simpler for small cases, but become more costly and error-prone as the size and complexity of a grammar increases. From forax at univ-mlv.fr Thu Jan 22 20:03:49 2015 From: forax at univ-mlv.fr (Remi Forax) Date: Thu, 22 Jan 2015 21:03:49 +0100 Subject: Parser generator actions In-Reply-To: <54C13BF3.4080903@oracle.com> References: <54C13BF3.4080903@oracle.com> Message-ID: <54C157A5.8070800@univ-mlv.fr> It seems I'm getting older and grumpy. There is still a problem with the way ANTLR v4 defined the listener. ANTLR use a parameterized interface to represent the listener which is not a good idea because it means that the result type of all actions must be the same. What's missing is a way to define the type of all non terminals and all terminals so different parts of the grammar can have different return type (it will also remove the need of boxing which is one limitation of current Java generics*) Note that all of this is old tech, each method of the listener defined an attribute grammar [1] with only one synthesized attribute. cheers, R?mi * at least for now. [1] https://en.wikipedia.org/wiki/Attribute_grammar On 01/22/2015 07:05 PM, eric.mccorkle at oracle.com wrote: > The following is a brief comparison of three techniques for specifying > the actions performed by a parser generator in response to the input. > This is motivated by the question of how best to maintain the ANTLR > parser going forward. > > == Background == > > Lexer and Parser generators have proven themselves to be a useful tool > for compiler development. The first and perhaps best-known set of > tools, lex and yacc, are still in use today, and similar tools have been > written for new languages (ml-lex/ml-yacc, ANTLR, alex/happy, and others). > > These tools are useful for a number of reasons, chief among them being > that they allow developers to work directly on the grammar as opposed to > writing a parsing procedure. > > Another key part of parser generators deals with how the generated > parser produces results for the rest of the compiler. The following > sections will cover three techniques, each of which is available in ANTLRv4 > > == Inline Actions == > > Inline actions are what most users of yacc-style tools are used to > seeing, as they are the /only/ way to perform actions in most such tools. > > Inline actions embed fragments of code into the grammar itself. The > following is an example: > > sum: sum '+' prod { return $1 + $3; } > | prod { return $1; } > ; > > prod: atom '*' prod { return $1 * $3; } > | atom { return $1; } > ; > > atom: '(' sum ')' { return $2; } > | literal { return Integer.parseInt($1); } > ; > > The actions enclosed in the parentheses are performed whenever the > parser reduces by the given rule, and the pseudo-variables $n refer to > the nth element of the rule. > > This method has the advantage that it has an implicit pattern-matching. > Each action knows exactly which rule it's responding to, how many > elements there are and what their types are, and so on. For example, in > the rules for prod, we have separate cases for a binary product and a > single atom. > > The disadvantage of this method is that it prevents the grammar itself > from being re-usable. A common technique when using parser generators > this way is either to publish a "bare" grammar, which programmers copy > and decorate with their own actions. > > == Parse Trees == > > ANTLRv4 introduces two new approaches aimed to eliminate the need to > write inline parser rules in the grammar, the goal being that the > grammar description can be made independent of the actions performed by > parsers. > > The first of these techniques is to have the parser emit a parse tree, > where every terminal and nonterminal is represented by a node. > Developers then write a visitor, which walks the parse tree and converts > it into an AST. > > This does have the advantage of separating the actions from the grammar, > but it also has a number of disadvantages. > > First, parse trees are a rather large data structure, which is produced > and then immediately converted into an AST. This leads to high memory > use and slower parsing times. > > Second, the visitor code does not benefit from the pattern-matching > behavior of the inline rules. In the example above, the visitor would > have one visitProd method for /both/ productions of prod; it would then > need to analyze the tree to determine which production had indeed led to > the node. > > For grammars for real languages, this can lead to complicated and > delicate code. It is also very easy to make a mistake that leads to > exponential translation times. Consider the following excerpt from the > java grammar: > > primaryNoNewArray > : literal > | primOrTypeName ('[' ']')* '.' 'class' > | 'this' > | typeName '.' 'this' > | '(' expression ')' > | primaryNoNewArray '.' 'new' typeArguments? annotation* Identifier > typeArgumentsOrDiamond? '(' argumentList? ')' classBody? > | 'new' typeArguments? annotation* Identifier ('.' annotation* > Identifier)* typeArgumentsOrDiamond? '(' argumentList? ')' classBody? > | expressionName '.' 'new' typeArguments? annotation* Identifier > typeArgumentsOrDiamond? '(' argumentList? ')' classBody? > | arrayCreationExpression '.' 'new' typeArguments? annotation* > Identifier typeArgumentsOrDiamond? '(' argumentList? ')' classBody? > | primaryNoNewArray '.' Identifier > | arrayCreationExpression '.' Identifier > | 'super' '.' Identifier > | typeName '.' 'super' '.' Identifier > | primaryNoNewArray '[' expression ']' > | expressionName '[' expression ']' > | primaryNoNewArray '.' typeArguments? Identifier '(' argumentList? ')' > | methodName '(' argumentList? ')' > | typeName '.' typeArguments? Identifier '(' argumentList? ')' > | expressionName '.' typeArguments? Identifier '(' argumentList? > ')' > | arrayCreationExpression '.' typeArguments? Identifier '(' > argumentList? ')' > | 'super' '.' typeArguments? Identifier '(' argumentList? ')' > | typeName '.' 'super' '.' typeArguments? Identifier '(' > argumentList? ')' > | primaryNoNewArray '::' typeArguments? Identifier > | expressionName '::' typeArguments? Identifier > | referenceType '::' typeArguments? Identifier > | arrayCreationExpression '::' typeArguments? Identifier > | 'super' '::' typeArguments? Identifier > | typeName '.' 'super' '::' typeArguments? Identifier > | classType '::' typeArguments? 'new' > | arrayType '::' 'new' > ; > > As you can see, this rule is quite complex, and many of its productions > have similar prefixes and/or suffixes. For many of the cases, we need > to descend further down the tree in order to figure out which rule was > reduced. > > Moreover, this rule is recursive, as it deals with invocation, member > access, and array indexing. Because it is recursive, programmers must > take great care to only ever visit a given subtree once, or else the > visitor's runtime becomes exponential in the depth of the tree. In > practice, it is quite easy to slip up and make this mistake (the > prototype java ANTLR parser had at least one such bug). Moreover, the > code that visits this parse tree node becomes quite complex, and it is > easy to make other kinds of errors. > > The source of all these problems is that the parse tree visitor method > discards the implicit pattern-matching. It bunches up cases into a > single node, which then must be separated back out again. > > == Rule Listeners == > > ANTLRv4 provides another mechanism for performing actions, which > captures the benefits of inline actions, while avoiding polluting the > grammar. The visitor approach allows code that would be expressed in > inline actions to be moved out to a listener class, whose methods are > called when corresponding rules are reduced. > > Returning to our previous example, we would have a grammar looking like > this: > > sum: sum '+' prod # SumBin > | prod # SumOne > ; > > prod: atom '*' prod # ProdBin > | atom # ProdOne > ; > > atom: '(' sum ')' # AtomParens > | literal # AtomLiteral > ; > > We would then have the following methods in our listener: > > public Integer visitSumBin(SumBinContext ctx) { > return ctx.sum() + ctx.prod(); > } > > public Integer visitSumOne(SumOneContext ctx) { > return ctx.prod(); > } > > public Integer visitProdBin(ProdBinContext ctx) { > return ctx.prod() + ctx.atom(); > } > > public Integer visitProdOne(ProdOneContext ctx) { > return ctx.atom(); > } > > public Integer visitAtomParens(AtomParensContext ctx) { > return ctx.sum(); > } > > public Integer visitProdOne(AtomLiteralContext ctx) { > return Integer.parseInt(ctx.literal()); > } > > This preserves the two key advantages of the inline rules: the implicit > pattern matching done by the parser, and the direct construction of the AST. > > In the more complex rule shown in the visitor section, we still know the > exact production we are seeing, and we know the exact number and types > of the subtrees. Therefore, we avoid the complex and error-prone code, > as well as the risk of exponential-time visitors that we had with the > parse tree. > > == Summary == > > Rule listeners arguably are the best option, because they provide all of > the benefits of inline rules, while avoiding pollution of the grammar files. > > Parse trees, by constrast, are simpler for small cases, but become more > costly and error-prone as the size and complexity of a grammar increases.