From eric.mccorkle at oracle.com  Fri Jan  9 19:26:07 2015
From: eric.mccorkle at oracle.com (eric.mccorkle at oracle.com)
Date: Fri, 09 Jan 2015 19:26:07 +0000
Subject: hg: compiler-grammar/jls8: Initial changes integrating antlr parser
	into build system
Message-ID: <201501091926.t09JQ8QV028720@aojmv0008>

Changeset: 46d7c54f212c
Author:    emc
Date:      2015-01-09 14:23 -0500
URL:       http://hg.openjdk.java.net/compiler-grammar/jls8/rev/46d7c54f212c

Initial changes integrating antlr parser into build system

! common/autoconf/configure.ac
! common/autoconf/generated-configure.sh
! common/autoconf/jdk-options.m4
! common/autoconf/libraries.m4
! common/autoconf/spec.gmk.in
! make/CompileJavaModules.gmk
! make/common/SetupJavaCompilers.gmk


From eric.mccorkle at oracle.com  Fri Jan  9 19:26:24 2015
From: eric.mccorkle at oracle.com (eric.mccorkle at oracle.com)
Date: Fri, 09 Jan 2015 19:26:24 +0000
Subject: hg: compiler-grammar/jls8/langtools: Interim patch fixing some parser
	bugs and integrating antlr parser
Message-ID: <201501091926.t09JQORa028781@aojmv0008>

Changeset: b7e4d595cab2
Author:    emc
Date:      2015-01-09 14:24 -0500
URL:       http://hg.openjdk.java.net/compiler-grammar/jls8/langtools/rev/b7e4d595cab2

Interim patch fixing some parser bugs and integrating antlr parser

! make/gensrc/Gensrc-jdk.compiler.gmk
! src/jdk.compiler/share/classes/com/sun/tools/javac/antlr/AntlrParserFactory.java
! src/jdk.compiler/share/classes/com/sun/tools/javac/antlr/Java.g4
! src/jdk.compiler/share/classes/com/sun/tools/javac/antlr/JavaTranslator.java
! src/jdk.compiler/share/classes/com/sun/tools/javac/jvm/Gen.java


From eric.mccorkle at oracle.com  Tue Jan 20 17:38:56 2015
From: eric.mccorkle at oracle.com (eric.mccorkle at oracle.com)
Date: Tue, 20 Jan 2015 17:38:56 +0000
Subject: hg: compiler-grammar/jls8/langtools: Used lazy evaluation technique
	to avoid exponential tree walks.
Message-ID: <201501201738.t0KHcuE3004740@aojmv0008>

Changeset: c4b51c1a84ba
Author:    emc
Date:      2015-01-20 12:37 -0500
URL:       http://hg.openjdk.java.net/compiler-grammar/jls8/langtools/rev/c4b51c1a84ba

Used lazy evaluation technique to avoid exponential tree walks.

! src/jdk.compiler/share/classes/com/sun/tools/javac/antlr/JavaTranslator.java


From eric.mccorkle at oracle.com  Thu Jan 22 15:10:10 2015
From: eric.mccorkle at oracle.com (eric.mccorkle at oracle.com)
Date: Thu, 22 Jan 2015 15:10:10 +0000
Subject: hg: compiler-grammar/jls8/langtools: Various bug fixes. Compiler can
	now finish 'make', but 'make all' hangs.
Message-ID: <201501221510.t0MFAFKl021015@aojmv0008>

Changeset: a4835d398c3c
Author:    emc
Date:      2015-01-22 10:09 -0500
URL:       http://hg.openjdk.java.net/compiler-grammar/jls8/langtools/rev/a4835d398c3c

Various bug fixes.  Compiler can now finish 'make', but 'make all' hangs.

! src/jdk.compiler/share/classes/com/sun/tools/javac/antlr/JavaTranslator.java


From eric.mccorkle at oracle.com  Thu Jan 22 16:20:50 2015
From: eric.mccorkle at oracle.com (eric.mccorkle at oracle.com)
Date: Thu, 22 Jan 2015 16:20:50 +0000
Subject: hg: compiler-grammar/jls8: Increase stack and memory
Message-ID: <201501221620.t0MGKoo6003379@aojmv0008>

Changeset: 95d07755cad4
Author:    emc
Date:      2015-01-22 11:20 -0500
URL:       http://hg.openjdk.java.net/compiler-grammar/jls8/rev/95d07755cad4

Increase stack and memory

! common/autoconf/boot-jdk.m4
! common/autoconf/generated-configure.sh


From eric.mccorkle at oracle.com  Thu Jan 22 18:05:39 2015
From: eric.mccorkle at oracle.com (eric.mccorkle at oracle.com)
Date: Thu, 22 Jan 2015 13:05:39 -0500
Subject: Parser generator actions
Message-ID: <54C13BF3.4080903@oracle.com>

The following is a brief comparison of three techniques for specifying
the actions performed by a parser generator in response to the input.
This is motivated by the question of how best to maintain the ANTLR
parser going forward.

== Background ==

Lexer and Parser generators have proven themselves to be a useful tool
for compiler development.  The first and perhaps best-known set of
tools, lex and yacc, are still in use today, and similar tools have been
written for new languages (ml-lex/ml-yacc, ANTLR, alex/happy, and others).

These tools are useful for a number of reasons, chief among them being
that they allow developers to work directly on the grammar as opposed to
writing a parsing procedure.

Another key part of parser generators deals with how the generated
parser produces results for the rest of the compiler.  The following
sections will cover three techniques, each of which is available in ANTLRv4

== Inline Actions ==

Inline actions are what most users of yacc-style tools are used to
seeing, as they are the /only/ way to perform actions in most such tools.

Inline actions embed fragments of code into the grammar itself.  The
following is an example:

sum: sum '+' prod { return $1 + $3; }
   | prod         { return $1; }
   ;

prod: atom '*' prod { return $1 * $3; }
    | atom          { return $1; }
    ;

atom: '(' sum ')' { return $2; }
    | literal     { return Integer.parseInt($1); }
    ;

The actions enclosed in the parentheses are performed whenever the
parser reduces by the given rule, and the pseudo-variables $n refer to
the nth element of the rule.

This method has the advantage that it has an implicit pattern-matching.
 Each action knows exactly which rule it's responding to, how many
elements there are and what their types are, and so on.  For example, in
the rules for prod, we have separate cases for a binary product and a
single atom.

The disadvantage of this method is that it prevents the grammar itself
from being re-usable.  A common technique when using parser generators
this way is either to publish a "bare" grammar, which programmers copy
and decorate with their own actions.

== Parse Trees ==

ANTLRv4 introduces two new approaches aimed to eliminate the need to
write inline parser rules in the grammar, the goal being that the
grammar description can be made independent of the actions performed by
parsers.

The first of these techniques is to have the parser emit a parse tree,
where every terminal and nonterminal is represented by a node.
Developers then write a visitor, which walks the parse tree and converts
it into an AST.

This does have the advantage of separating the actions from the grammar,
but it also has a number of disadvantages.

First, parse trees are a rather large data structure, which is produced
and then immediately converted into an AST.  This leads to high memory
use and slower parsing times.

Second, the visitor code does not benefit from the pattern-matching
behavior of the inline rules.  In the example above, the visitor would
have one visitProd method for /both/ productions of prod; it would then
need to analyze the tree to determine which production had indeed led to
the node.

For grammars for real languages, this can lead to complicated and
delicate code.  It is also very easy to make a mistake that leads to
exponential translation times.  Consider the following excerpt from the
java grammar:

primaryNoNewArray
	: literal
	| primOrTypeName ('[' ']')* '.' 'class'
	| 'this'
	| typeName '.' 'this'
	| '(' expression ')'
	| primaryNoNewArray '.' 'new' typeArguments? annotation* Identifier
typeArgumentsOrDiamond? '(' argumentList? ')' classBody?
	| 'new' typeArguments? annotation* Identifier ('.' annotation*
Identifier)* typeArgumentsOrDiamond? '(' argumentList? ')' classBody?
        | expressionName '.' 'new' typeArguments? annotation* Identifier
typeArgumentsOrDiamond? '(' argumentList? ')' classBody?
        | arrayCreationExpression '.' 'new' typeArguments? annotation*
Identifier typeArgumentsOrDiamond? '(' argumentList? ')' classBody?
	| primaryNoNewArray '.' Identifier
	| arrayCreationExpression '.' Identifier
        | 'super' '.' Identifier
        | typeName '.' 'super' '.' Identifier
	| primaryNoNewArray '[' expression ']'
	| expressionName '[' expression ']'
	| primaryNoNewArray '.' typeArguments? Identifier '(' argumentList? ')'
	| methodName '(' argumentList? ')'
        | typeName '.' typeArguments? Identifier '(' argumentList? ')'
        | expressionName '.' typeArguments? Identifier '(' argumentList?
')'
        | arrayCreationExpression '.' typeArguments? Identifier '('
argumentList? ')'
        | 'super' '.' typeArguments? Identifier '(' argumentList? ')'
        | typeName '.' 'super' '.' typeArguments? Identifier '('
argumentList? ')'
	| primaryNoNewArray '::' typeArguments? Identifier
	| expressionName '::' typeArguments? Identifier
        | referenceType '::' typeArguments? Identifier
        | arrayCreationExpression '::' typeArguments? Identifier
        | 'super' '::' typeArguments? Identifier
        | typeName '.' 'super' '::' typeArguments? Identifier
        | classType '::' typeArguments? 'new'
        | arrayType '::' 'new'
;

As you can see, this rule is quite complex, and many of its productions
have similar prefixes and/or suffixes.  For many of the cases, we need
to descend further down the tree in order to figure out which rule was
reduced.

Moreover, this rule is recursive, as it deals with invocation, member
access, and array indexing.  Because it is recursive, programmers must
take great care to only ever visit a given subtree once, or else the
visitor's runtime becomes exponential in the depth of the tree.  In
practice, it is quite easy to slip up and make this mistake (the
prototype java ANTLR parser had at least one such bug).  Moreover, the
code that visits this parse tree node becomes quite complex, and it is
easy to make other kinds of errors.

The source of all these problems is that the parse tree visitor method
discards the implicit pattern-matching.  It bunches up cases into a
single node, which then must be separated back out again.

== Rule Listeners ==

ANTLRv4 provides another mechanism for performing actions, which
captures the benefits of inline actions, while avoiding polluting the
grammar.  The visitor approach allows code that would be expressed in
inline actions to be moved out to a listener class, whose methods are
called when corresponding rules are reduced.

Returning to our previous example, we would have a grammar looking like
this:

sum: sum '+' prod # SumBin
   | prod         # SumOne
   ;

prod: atom '*' prod # ProdBin
    | atom          # ProdOne
    ;

atom: '(' sum ')' # AtomParens
    | literal     # AtomLiteral
    ;

We would then have the following methods in our listener:

public Integer visitSumBin(SumBinContext ctx) {
    return ctx.sum() + ctx.prod();
}

public Integer visitSumOne(SumOneContext ctx) {
    return ctx.prod();
}

public Integer visitProdBin(ProdBinContext ctx) {
    return ctx.prod() + ctx.atom();
}

public Integer visitProdOne(ProdOneContext ctx) {
    return ctx.atom();
}

public Integer visitAtomParens(AtomParensContext ctx) {
    return ctx.sum();
}

public Integer visitProdOne(AtomLiteralContext ctx) {
    return Integer.parseInt(ctx.literal());
}

This preserves the two key advantages of the inline rules: the implicit
pattern matching done by the parser, and the direct construction of the AST.

In the more complex rule shown in the visitor section, we still know the
exact production we are seeing, and we know the exact number and types
of the subtrees.  Therefore, we avoid the complex and error-prone code,
as well as the risk of exponential-time visitors that we had with the
parse tree.

== Summary ==

Rule listeners arguably are the best option, because they provide all of
the benefits of inline rules, while avoiding pollution of the grammar files.

Parse trees, by constrast, are simpler for small cases, but become more
costly and error-prone as the size and complexity of a grammar increases.

From forax at univ-mlv.fr  Thu Jan 22 20:03:49 2015
From: forax at univ-mlv.fr (Remi Forax)
Date: Thu, 22 Jan 2015 21:03:49 +0100
Subject: Parser generator actions
In-Reply-To: <54C13BF3.4080903@oracle.com>
References: <54C13BF3.4080903@oracle.com>
Message-ID: <54C157A5.8070800@univ-mlv.fr>

It seems I'm getting older and grumpy.

There is still a problem with the way ANTLR v4 defined the listener.
ANTLR use a parameterized interface to represent the listener which is 
not a good idea
because it means that the result type of all actions must be the same.

What's missing is a way to define the type of all non terminals and all 
terminals
so different parts of the grammar can have different return type
(it will also remove the need of boxing which is one limitation of 
current Java generics*)

Note that all of this is old tech, each method of the listener defined 
an attribute grammar [1]
with only one synthesized attribute.

cheers,
R?mi

* at least for now.
[1] https://en.wikipedia.org/wiki/Attribute_grammar


On 01/22/2015 07:05 PM, eric.mccorkle at oracle.com wrote:
> The following is a brief comparison of three techniques for specifying
> the actions performed by a parser generator in response to the input.
> This is motivated by the question of how best to maintain the ANTLR
> parser going forward.
>
> == Background ==
>
> Lexer and Parser generators have proven themselves to be a useful tool
> for compiler development.  The first and perhaps best-known set of
> tools, lex and yacc, are still in use today, and similar tools have been
> written for new languages (ml-lex/ml-yacc, ANTLR, alex/happy, and others).
>
> These tools are useful for a number of reasons, chief among them being
> that they allow developers to work directly on the grammar as opposed to
> writing a parsing procedure.
>
> Another key part of parser generators deals with how the generated
> parser produces results for the rest of the compiler.  The following
> sections will cover three techniques, each of which is available in ANTLRv4
>
> == Inline Actions ==
>
> Inline actions are what most users of yacc-style tools are used to
> seeing, as they are the /only/ way to perform actions in most such tools.
>
> Inline actions embed fragments of code into the grammar itself.  The
> following is an example:
>
> sum: sum '+' prod { return $1 + $3; }
>     | prod         { return $1; }
>     ;
>
> prod: atom '*' prod { return $1 * $3; }
>      | atom          { return $1; }
>      ;
>
> atom: '(' sum ')' { return $2; }
>      | literal     { return Integer.parseInt($1); }
>      ;
>
> The actions enclosed in the parentheses are performed whenever the
> parser reduces by the given rule, and the pseudo-variables $n refer to
> the nth element of the rule.
>
> This method has the advantage that it has an implicit pattern-matching.
>   Each action knows exactly which rule it's responding to, how many
> elements there are and what their types are, and so on.  For example, in
> the rules for prod, we have separate cases for a binary product and a
> single atom.
>
> The disadvantage of this method is that it prevents the grammar itself
> from being re-usable.  A common technique when using parser generators
> this way is either to publish a "bare" grammar, which programmers copy
> and decorate with their own actions.
>
> == Parse Trees ==
>
> ANTLRv4 introduces two new approaches aimed to eliminate the need to
> write inline parser rules in the grammar, the goal being that the
> grammar description can be made independent of the actions performed by
> parsers.
>
> The first of these techniques is to have the parser emit a parse tree,
> where every terminal and nonterminal is represented by a node.
> Developers then write a visitor, which walks the parse tree and converts
> it into an AST.
>
> This does have the advantage of separating the actions from the grammar,
> but it also has a number of disadvantages.
>
> First, parse trees are a rather large data structure, which is produced
> and then immediately converted into an AST.  This leads to high memory
> use and slower parsing times.
>
> Second, the visitor code does not benefit from the pattern-matching
> behavior of the inline rules.  In the example above, the visitor would
> have one visitProd method for /both/ productions of prod; it would then
> need to analyze the tree to determine which production had indeed led to
> the node.
>
> For grammars for real languages, this can lead to complicated and
> delicate code.  It is also very easy to make a mistake that leads to
> exponential translation times.  Consider the following excerpt from the
> java grammar:
>
> primaryNoNewArray
> 	: literal
> 	| primOrTypeName ('[' ']')* '.' 'class'
> 	| 'this'
> 	| typeName '.' 'this'
> 	| '(' expression ')'
> 	| primaryNoNewArray '.' 'new' typeArguments? annotation* Identifier
> typeArgumentsOrDiamond? '(' argumentList? ')' classBody?
> 	| 'new' typeArguments? annotation* Identifier ('.' annotation*
> Identifier)* typeArgumentsOrDiamond? '(' argumentList? ')' classBody?
>          | expressionName '.' 'new' typeArguments? annotation* Identifier
> typeArgumentsOrDiamond? '(' argumentList? ')' classBody?
>          | arrayCreationExpression '.' 'new' typeArguments? annotation*
> Identifier typeArgumentsOrDiamond? '(' argumentList? ')' classBody?
> 	| primaryNoNewArray '.' Identifier
> 	| arrayCreationExpression '.' Identifier
>          | 'super' '.' Identifier
>          | typeName '.' 'super' '.' Identifier
> 	| primaryNoNewArray '[' expression ']'
> 	| expressionName '[' expression ']'
> 	| primaryNoNewArray '.' typeArguments? Identifier '(' argumentList? ')'
> 	| methodName '(' argumentList? ')'
>          | typeName '.' typeArguments? Identifier '(' argumentList? ')'
>          | expressionName '.' typeArguments? Identifier '(' argumentList?
> ')'
>          | arrayCreationExpression '.' typeArguments? Identifier '('
> argumentList? ')'
>          | 'super' '.' typeArguments? Identifier '(' argumentList? ')'
>          | typeName '.' 'super' '.' typeArguments? Identifier '('
> argumentList? ')'
> 	| primaryNoNewArray '::' typeArguments? Identifier
> 	| expressionName '::' typeArguments? Identifier
>          | referenceType '::' typeArguments? Identifier
>          | arrayCreationExpression '::' typeArguments? Identifier
>          | 'super' '::' typeArguments? Identifier
>          | typeName '.' 'super' '::' typeArguments? Identifier
>          | classType '::' typeArguments? 'new'
>          | arrayType '::' 'new'
> ;
>
> As you can see, this rule is quite complex, and many of its productions
> have similar prefixes and/or suffixes.  For many of the cases, we need
> to descend further down the tree in order to figure out which rule was
> reduced.
>
> Moreover, this rule is recursive, as it deals with invocation, member
> access, and array indexing.  Because it is recursive, programmers must
> take great care to only ever visit a given subtree once, or else the
> visitor's runtime becomes exponential in the depth of the tree.  In
> practice, it is quite easy to slip up and make this mistake (the
> prototype java ANTLR parser had at least one such bug).  Moreover, the
> code that visits this parse tree node becomes quite complex, and it is
> easy to make other kinds of errors.
>
> The source of all these problems is that the parse tree visitor method
> discards the implicit pattern-matching.  It bunches up cases into a
> single node, which then must be separated back out again.
>
> == Rule Listeners ==
>
> ANTLRv4 provides another mechanism for performing actions, which
> captures the benefits of inline actions, while avoiding polluting the
> grammar.  The visitor approach allows code that would be expressed in
> inline actions to be moved out to a listener class, whose methods are
> called when corresponding rules are reduced.
>
> Returning to our previous example, we would have a grammar looking like
> this:
>
> sum: sum '+' prod # SumBin
>     | prod         # SumOne
>     ;
>
> prod: atom '*' prod # ProdBin
>      | atom          # ProdOne
>      ;
>
> atom: '(' sum ')' # AtomParens
>      | literal     # AtomLiteral
>      ;
>
> We would then have the following methods in our listener:
>
> public Integer visitSumBin(SumBinContext ctx) {
>      return ctx.sum() + ctx.prod();
> }
>
> public Integer visitSumOne(SumOneContext ctx) {
>      return ctx.prod();
> }
>
> public Integer visitProdBin(ProdBinContext ctx) {
>      return ctx.prod() + ctx.atom();
> }
>
> public Integer visitProdOne(ProdOneContext ctx) {
>      return ctx.atom();
> }
>
> public Integer visitAtomParens(AtomParensContext ctx) {
>      return ctx.sum();
> }
>
> public Integer visitProdOne(AtomLiteralContext ctx) {
>      return Integer.parseInt(ctx.literal());
> }
>
> This preserves the two key advantages of the inline rules: the implicit
> pattern matching done by the parser, and the direct construction of the AST.
>
> In the more complex rule shown in the visitor section, we still know the
> exact production we are seeing, and we know the exact number and types
> of the subtrees.  Therefore, we avoid the complex and error-prone code,
> as well as the risk of exponential-time visitors that we had with the
> parse tree.
>
> == Summary ==
>
> Rule listeners arguably are the best option, because they provide all of
> the benefits of inline rules, while avoiding pollution of the grammar files.
>
> Parse trees, by constrast, are simpler for small cases, but become more
> costly and error-prone as the size and complexity of a grammar increases.