javac lexer parser rewrite

Thu Feb 9 05:17:16 PST 2012

-------- Original Message --------
From: Rémi Forax<forax at univ-mlv.fr>
Apparently from: compiler-dev-bounces at openjdk.java.net
To: compiler-dev at openjdk.java.net
Subject: Re: javac lexer parser rewrite
Date: Wed, 08 Feb 2012 17:55:33 +0100

> On 02/08/2012 04:52 PM, leszekp at safe-mail.net wrote:
> > Both hand-coded parser and generated one has some advantages and disadvantages.
> > Of course it is good to have plain java code and have the possiblitity to debug it.
> >
> > But as level of complication rises, at some point hand-written parser becomes unamanageable anyway.
> 
> Do you have taken a look to the source code. ?
> 
> The parser is readable mostly because most the methods correspond to
> a LR state and that the dotted productions of that state
> are available in the documentation of the methods.

It depends on what you define as readable. For a creator of javac parser it is probably very readable :)
but for a developer trying to experiment with it represents a high barrier to entry.

What I'm going to say is that yacc like specification (rules + attached actions)
would be more readable and easier to modify.

> 
> > In example Pascal language was designed to be LL(1) and hand-written recursive descent
> > parser for this language is probably quite understandable. But java wasn't designed that way
> > and its hand-written parser has a lot of quirks which made it complicated to understand.
> 
> Sorry, Java 1.0 was designed to be LALR(1), there is some discussions in 
> the JLS 1 about that.
> 
> The main issue, is more that the grammar of the spec was not updated 
> correctly after that.
> This had some painful drawbacks like by example the fact that some parts 
> of the grammar
> (the one allocating a generics inner class by example) was missing from 
> the parser
> when the jdk 1.5 was released.
> 
> Also while I know that the grammar of Java 5 is LALR, I've no idea in 
> the one of
> the upcoming Java 8 is still LALR.
> 
I realize that java was designed to be LALR(1) - at least initially.
LL(1) grammars fits well for hand-made recursive descent implementation
LALR grammars are much harder to implement by hand, thats why they are rather processed by parser generators.

> >
> > I am experimenting with jflex generated java lexer. It is very fast - comparable
> > to original javac Scanner, it is promising.
> 
> The problem is more the parser than the lexer.
> 
> A good project should be to write a parser generator that takes the Java 
> grammar
> and generates the same code as the existing parser, by the way.

I agree with you. java parser is far more complicated than lexer,
so making it more clean would be more beneficial than rewriting the lexer.
But I understand it in another way.
A good project would be to write LALR grammar which after processing generates java parser and:
1 Grammar should be more readable than existing java parser code
2 Generated parser should generate the same output (abstract syntax tree) for the same program
3 Generated parser should not be significantly slower - ideally faster
4 Ideally it should report the same set of errors.

regards
Leszek