Parser generator actions

Thu Jan 22 18:05:39 UTC 2015

The following is a brief comparison of three techniques for specifying
the actions performed by a parser generator in response to the input.
This is motivated by the question of how best to maintain the ANTLR
parser going forward.

== Background ==

Lexer and Parser generators have proven themselves to be a useful tool
for compiler development.  The first and perhaps best-known set of
tools, lex and yacc, are still in use today, and similar tools have been
written for new languages (ml-lex/ml-yacc, ANTLR, alex/happy, and others).

These tools are useful for a number of reasons, chief among them being
that they allow developers to work directly on the grammar as opposed to
writing a parsing procedure.

Another key part of parser generators deals with how the generated
parser produces results for the rest of the compiler.  The following
sections will cover three techniques, each of which is available in ANTLRv4

== Inline Actions ==

Inline actions are what most users of yacc-style tools are used to
seeing, as they are the /only/ way to perform actions in most such tools.

Inline actions embed fragments of code into the grammar itself.  The
following is an example:

sum: sum '+' prod { return $1 + $3; }
   | prod         { return $1; }
   ;

prod: atom '*' prod { return $1 * $3; }
    | atom          { return $1; }
    ;

atom: '(' sum ')' { return $2; }
    | literal     { return Integer.parseInt($1); }
    ;

The actions enclosed in the parentheses are performed whenever the
parser reduces by the given rule, and the pseudo-variables $n refer to
the nth element of the rule.

This method has the advantage that it has an implicit pattern-matching.
 Each action knows exactly which rule it's responding to, how many
elements there are and what their types are, and so on.  For example, in
the rules for prod, we have separate cases for a binary product and a
single atom.

The disadvantage of this method is that it prevents the grammar itself
from being re-usable.  A common technique when using parser generators
this way is either to publish a "bare" grammar, which programmers copy
and decorate with their own actions.

== Parse Trees ==

ANTLRv4 introduces two new approaches aimed to eliminate the need to
write inline parser rules in the grammar, the goal being that the
grammar description can be made independent of the actions performed by
parsers.

The first of these techniques is to have the parser emit a parse tree,
where every terminal and nonterminal is represented by a node.
Developers then write a visitor, which walks the parse tree and converts
it into an AST.

This does have the advantage of separating the actions from the grammar,
but it also has a number of disadvantages.

First, parse trees are a rather large data structure, which is produced
and then immediately converted into an AST.  This leads to high memory
use and slower parsing times.

Second, the visitor code does not benefit from the pattern-matching
behavior of the inline rules.  In the example above, the visitor would
have one visitProd method for /both/ productions of prod; it would then
need to analyze the tree to determine which production had indeed led to
the node.

For grammars for real languages, this can lead to complicated and
delicate code.  It is also very easy to make a mistake that leads to
exponential translation times.  Consider the following excerpt from the
java grammar:

primaryNoNewArray
	: literal
	| primOrTypeName ('[' ']')* '.' 'class'
	| 'this'
	| typeName '.' 'this'
	| '(' expression ')'
	| primaryNoNewArray '.' 'new' typeArguments? annotation* Identifier
typeArgumentsOrDiamond? '(' argumentList? ')' classBody?
	| 'new' typeArguments? annotation* Identifier ('.' annotation*
Identifier)* typeArgumentsOrDiamond? '(' argumentList? ')' classBody?
        | expressionName '.' 'new' typeArguments? annotation* Identifier
typeArgumentsOrDiamond? '(' argumentList? ')' classBody?
        | arrayCreationExpression '.' 'new' typeArguments? annotation*
Identifier typeArgumentsOrDiamond? '(' argumentList? ')' classBody?
	| primaryNoNewArray '.' Identifier
	| arrayCreationExpression '.' Identifier
        | 'super' '.' Identifier
        | typeName '.' 'super' '.' Identifier
	| primaryNoNewArray '[' expression ']'
	| expressionName '[' expression ']'
	| primaryNoNewArray '.' typeArguments? Identifier '(' argumentList? ')'
	| methodName '(' argumentList? ')'
        | typeName '.' typeArguments? Identifier '(' argumentList? ')'
        | expressionName '.' typeArguments? Identifier '(' argumentList?
')'
        | arrayCreationExpression '.' typeArguments? Identifier '('
argumentList? ')'
        | 'super' '.' typeArguments? Identifier '(' argumentList? ')'
        | typeName '.' 'super' '.' typeArguments? Identifier '('
argumentList? ')'
	| primaryNoNewArray '::' typeArguments? Identifier
	| expressionName '::' typeArguments? Identifier
        | referenceType '::' typeArguments? Identifier
        | arrayCreationExpression '::' typeArguments? Identifier
        | 'super' '::' typeArguments? Identifier
        | typeName '.' 'super' '::' typeArguments? Identifier
        | classType '::' typeArguments? 'new'
        | arrayType '::' 'new'
;

As you can see, this rule is quite complex, and many of its productions
have similar prefixes and/or suffixes.  For many of the cases, we need
to descend further down the tree in order to figure out which rule was
reduced.

Moreover, this rule is recursive, as it deals with invocation, member
access, and array indexing.  Because it is recursive, programmers must
take great care to only ever visit a given subtree once, or else the
visitor's runtime becomes exponential in the depth of the tree.  In
practice, it is quite easy to slip up and make this mistake (the
prototype java ANTLR parser had at least one such bug).  Moreover, the
code that visits this parse tree node becomes quite complex, and it is
easy to make other kinds of errors.

The source of all these problems is that the parse tree visitor method
discards the implicit pattern-matching.  It bunches up cases into a
single node, which then must be separated back out again.

== Rule Listeners ==

ANTLRv4 provides another mechanism for performing actions, which
captures the benefits of inline actions, while avoiding polluting the
grammar.  The visitor approach allows code that would be expressed in
inline actions to be moved out to a listener class, whose methods are
called when corresponding rules are reduced.

Returning to our previous example, we would have a grammar looking like
this:

sum: sum '+' prod # SumBin
   | prod         # SumOne
   ;

prod: atom '*' prod # ProdBin
    | atom          # ProdOne
    ;

atom: '(' sum ')' # AtomParens
    | literal     # AtomLiteral
    ;

We would then have the following methods in our listener:

public Integer visitSumBin(SumBinContext ctx) {
    return ctx.sum() + ctx.prod();
}

public Integer visitSumOne(SumOneContext ctx) {
    return ctx.prod();
}

public Integer visitProdBin(ProdBinContext ctx) {
    return ctx.prod() + ctx.atom();
}

public Integer visitProdOne(ProdOneContext ctx) {
    return ctx.atom();
}

public Integer visitAtomParens(AtomParensContext ctx) {
    return ctx.sum();
}

public Integer visitProdOne(AtomLiteralContext ctx) {
    return Integer.parseInt(ctx.literal());
}

This preserves the two key advantages of the inline rules: the implicit
pattern matching done by the parser, and the direct construction of the AST.

In the more complex rule shown in the visitor section, we still know the
exact production we are seeing, and we know the exact number and types
of the subtrees.  Therefore, we avoid the complex and error-prone code,
as well as the risk of exponential-time visitors that we had with the
parse tree.

== Summary ==

Rule listeners arguably are the best option, because they provide all of
the benefits of inline rules, while avoiding pollution of the grammar files.

Parse trees, by constrast, are simpler for small cases, but become more
costly and error-prone as the size and complexity of a grammar increases.