Parser generator actions

Thu Jan 22 20:03:49 UTC 2015

It seems I'm getting older and grumpy.

There is still a problem with the way ANTLR v4 defined the listener.
ANTLR use a parameterized interface to represent the listener which is 
not a good idea
because it means that the result type of all actions must be the same.

What's missing is a way to define the type of all non terminals and all 
terminals
so different parts of the grammar can have different return type
(it will also remove the need of boxing which is one limitation of 
current Java generics*)

Note that all of this is old tech, each method of the listener defined 
an attribute grammar [1]
with only one synthesized attribute.

cheers,
Rémi

* at least for now.
[1] https://en.wikipedia.org/wiki/Attribute_grammar

On 01/22/2015 07:05 PM, eric.mccorkle at oracle.com wrote:
> The following is a brief comparison of three techniques for specifying
> the actions performed by a parser generator in response to the input.
> This is motivated by the question of how best to maintain the ANTLR
> parser going forward.
>
> == Background ==
>
> Lexer and Parser generators have proven themselves to be a useful tool
> for compiler development.  The first and perhaps best-known set of
> tools, lex and yacc, are still in use today, and similar tools have been
> written for new languages (ml-lex/ml-yacc, ANTLR, alex/happy, and others).
>
> These tools are useful for a number of reasons, chief among them being
> that they allow developers to work directly on the grammar as opposed to
> writing a parsing procedure.
>
> Another key part of parser generators deals with how the generated
> parser produces results for the rest of the compiler.  The following
> sections will cover three techniques, each of which is available in ANTLRv4
>
> == Inline Actions ==
>
> Inline actions are what most users of yacc-style tools are used to
> seeing, as they are the /only/ way to perform actions in most such tools.
>
> Inline actions embed fragments of code into the grammar itself.  The
> following is an example:
>
> sum: sum '+' prod { return $1 + $3; }
>     | prod         { return $1; }
>     ;
>
> prod: atom '*' prod { return $1 * $3; }
>      | atom          { return $1; }
>      ;
>
> atom: '(' sum ')' { return $2; }
>      | literal     { return Integer.parseInt($1); }
>      ;
>
> The actions enclosed in the parentheses are performed whenever the
> parser reduces by the given rule, and the pseudo-variables $n refer to
> the nth element of the rule.
>
> This method has the advantage that it has an implicit pattern-matching.
>   Each action knows exactly which rule it's responding to, how many
> elements there are and what their types are, and so on.  For example, in
> the rules for prod, we have separate cases for a binary product and a
> single atom.
>
> The disadvantage of this method is that it prevents the grammar itself
> from being re-usable.  A common technique when using parser generators
> this way is either to publish a "bare" grammar, which programmers copy
> and decorate with their own actions.
>
> == Parse Trees ==
>
> ANTLRv4 introduces two new approaches aimed to eliminate the need to
> write inline parser rules in the grammar, the goal being that the
> grammar description can be made independent of the actions performed by
> parsers.
>
> The first of these techniques is to have the parser emit a parse tree,
> where every terminal and nonterminal is represented by a node.
> Developers then write a visitor, which walks the parse tree and converts
> it into an AST.
>
> This does have the advantage of separating the actions from the grammar,
> but it also has a number of disadvantages.
>
> First, parse trees are a rather large data structure, which is produced
> and then immediately converted into an AST.  This leads to high memory
> use and slower parsing times.
>
> Second, the visitor code does not benefit from the pattern-matching
> behavior of the inline rules.  In the example above, the visitor would
> have one visitProd method for /both/ productions of prod; it would then
> need to analyze the tree to determine which production had indeed led to
> the node.
>
> For grammars for real languages, this can lead to complicated and
> delicate code.  It is also very easy to make a mistake that leads to
> exponential translation times.  Consider the following excerpt from the
> java grammar:
>
> primaryNoNewArray
> 	: literal
> 	| primOrTypeName ('[' ']')* '.' 'class'
> 	| 'this'
> 	| typeName '.' 'this'
> 	| '(' expression ')'
> 	| primaryNoNewArray '.' 'new' typeArguments? annotation* Identifier
> typeArgumentsOrDiamond? '(' argumentList? ')' classBody?
> 	| 'new' typeArguments? annotation* Identifier ('.' annotation*
> Identifier)* typeArgumentsOrDiamond? '(' argumentList? ')' classBody?
>          | expressionName '.' 'new' typeArguments? annotation* Identifier
> typeArgumentsOrDiamond? '(' argumentList? ')' classBody?
>          | arrayCreationExpression '.' 'new' typeArguments? annotation*
> Identifier typeArgumentsOrDiamond? '(' argumentList? ')' classBody?
> 	| primaryNoNewArray '.' Identifier
> 	| arrayCreationExpression '.' Identifier
>          | 'super' '.' Identifier
>          | typeName '.' 'super' '.' Identifier
> 	| primaryNoNewArray '[' expression ']'
> 	| expressionName '[' expression ']'
> 	| primaryNoNewArray '.' typeArguments? Identifier '(' argumentList? ')'
> 	| methodName '(' argumentList? ')'
>          | typeName '.' typeArguments? Identifier '(' argumentList? ')'
>          | expressionName '.' typeArguments? Identifier '(' argumentList?
> ')'
>          | arrayCreationExpression '.' typeArguments? Identifier '('
> argumentList? ')'
>          | 'super' '.' typeArguments? Identifier '(' argumentList? ')'
>          | typeName '.' 'super' '.' typeArguments? Identifier '('
> argumentList? ')'
> 	| primaryNoNewArray '::' typeArguments? Identifier
> 	| expressionName '::' typeArguments? Identifier
>          | referenceType '::' typeArguments? Identifier
>          | arrayCreationExpression '::' typeArguments? Identifier
>          | 'super' '::' typeArguments? Identifier
>          | typeName '.' 'super' '::' typeArguments? Identifier
>          | classType '::' typeArguments? 'new'
>          | arrayType '::' 'new'
> ;
>
> As you can see, this rule is quite complex, and many of its productions
> have similar prefixes and/or suffixes.  For many of the cases, we need
> to descend further down the tree in order to figure out which rule was
> reduced.
>
> Moreover, this rule is recursive, as it deals with invocation, member
> access, and array indexing.  Because it is recursive, programmers must
> take great care to only ever visit a given subtree once, or else the
> visitor's runtime becomes exponential in the depth of the tree.  In
> practice, it is quite easy to slip up and make this mistake (the
> prototype java ANTLR parser had at least one such bug).  Moreover, the
> code that visits this parse tree node becomes quite complex, and it is
> easy to make other kinds of errors.
>
> The source of all these problems is that the parse tree visitor method
> discards the implicit pattern-matching.  It bunches up cases into a
> single node, which then must be separated back out again.
>
> == Rule Listeners ==
>
> ANTLRv4 provides another mechanism for performing actions, which
> captures the benefits of inline actions, while avoiding polluting the
> grammar.  The visitor approach allows code that would be expressed in
> inline actions to be moved out to a listener class, whose methods are
> called when corresponding rules are reduced.
>
> Returning to our previous example, we would have a grammar looking like
> this:
>
> sum: sum '+' prod # SumBin
>     | prod         # SumOne
>     ;
>
> prod: atom '*' prod # ProdBin
>      | atom          # ProdOne
>      ;
>
> atom: '(' sum ')' # AtomParens
>      | literal     # AtomLiteral
>      ;
>
> We would then have the following methods in our listener:
>
> public Integer visitSumBin(SumBinContext ctx) {
>      return ctx.sum() + ctx.prod();
> }
>
> public Integer visitSumOne(SumOneContext ctx) {
>      return ctx.prod();
> }
>
> public Integer visitProdBin(ProdBinContext ctx) {
>      return ctx.prod() + ctx.atom();
> }
>
> public Integer visitProdOne(ProdOneContext ctx) {
>      return ctx.atom();
> }
>
> public Integer visitAtomParens(AtomParensContext ctx) {
>      return ctx.sum();
> }
>
> public Integer visitProdOne(AtomLiteralContext ctx) {
>      return Integer.parseInt(ctx.literal());
> }
>
> This preserves the two key advantages of the inline rules: the implicit
> pattern matching done by the parser, and the direct construction of the AST.
>
> In the more complex rule shown in the visitor section, we still know the
> exact production we are seeing, and we know the exact number and types
> of the subtrees.  Therefore, we avoid the complex and error-prone code,
> as well as the risk of exponential-time visitors that we had with the
> parse tree.
>
> == Summary ==
>
> Rule listeners arguably are the best option, because they provide all of
> the benefits of inline rules, while avoiding pollution of the grammar files.
>
> Parse trees, by constrast, are simpler for small cases, but become more
> costly and error-prone as the size and complexity of a grammar increases.