Compiling Java 9

Wed Nov 16 21:02:14 UTC 2016

On 11/15/2016 6:34 AM, Stephan Herrmann wrote:
> On 11/15/2016 12:20 AM, Alex Buckley wrote:
>> You have a lot of questions here, mainly not about the OpenJDK
>> implementation of JSR 379 but rather about JSR 379 itself. I'm
>> working on JLS text that will be presented alongside lang-vm.html in
>> the next few weeks, but here are some quick points:
>
> I'm looking forward to the JLS text very much ...

(Oops, I meant JSR 376 (Module System) rather than 379 (SE 9 Platform).)

>> - lang-vm.html does not say that restricted keywords are keywords only
>> inside a module declaration; it says they are keywords
>> _solely where they appear as terminals in ModuleDeclaration_.
>
> Thanks, so for one this settles, that the compilation unit is
> irrelevant for the questions at hand, only the syntactical structure
> of the content is relevant, right?

I get the feeling that by "compilation unit" you mean "file on disk", 
but I never mean that. A compilation unit is a sequence of tokens 
(technically, input elements per JLS 3.5) that matches the 
CompilationUnit production; I have no interest in whether the sequence 
comes from a file, and if it does, what that file's name is.

As to matching the sequence to one of the two alternatives for 
CompilationUnit:

> I'm still reluctant to believe the full consequences of what you seem
> to be saying. With ModuleDeclaration being a non-terminal in the
> grammar, is the definition of "restricted keywords" intended to mean:
>     Whether or not a word is a keyword or identifier is decided *after*
> parsing has completed?
>     Viz.: if a given text could be parsed as a ModuleDeclaration then
> exactly those words that have been used as keywords are keywords.
> It seems so, because before parsing we don't have any knowledge
> whether or not a given text *is* a ModuleDeclaration. Is it even
> justified to speak of "parsing" in this context? Syntax inference?

- If you lex 'package', then the sequence must parse as the first 
alternative.
- If you don't lex 'package', but rather lex 'import', then parsing is 
ambiguous until you've looked ahead to lex either 'open', 'module', or a 
keyword that can start TypeDeclaration. [Ignoring annotations for 
simplicity.]
- If you lexed 'open' or module', then the sequence must parse as the 
second alternative; if you lexed anything else, then the sequence must 
parse as the first alternative.

> The current grammar can be parsed, e.g., with a hand-written scannerless
> parser,
> but can we assume any regularities about the grammar now and in the future?
> If not, should all tool implementers be prepared to replace the parser with
> full pattern matching with back tracking, as to try all possible
> combinations of
> interpreting keyword candidates as keywords or identifiers?

My interest is Java SE 9, where we face the problem that 'open', 
'module', 'requires', etc have been parsed as identifiers for 20 years 
so we can't turn them into traditional keywords.

We could do something like saying they're keywords only in a 
ModuleDeclaration, which restricts the parser trickery to just the 
'open' and 'module' terms at the start of a ModuleDeclaration. But then, 
you couldn't have 'open' (say) as part of a module name, and we think 
that's an unreasonable restriction. (What about module names that wish 
to begin with a digit, which seems reasonable yet is prohibited by the 
use of Identifier in ModuleName? Good question.) So, we say these terms 
are identifiers like previously, except where they can be keywords in 
ModuleDeclaration.

> Thinking aloud about possible consequences, I wonder what happens in
> case of a syntax error. Strictly speaking a text that almost looks
> like a ModuleDeclaration but still cannot be fully accepted as such,
> contains no keywords at all (because we have no ModuleDeclaration),
> right? That's probably a fact which tools should carefully hide from
> users, to avoid producing utterly confusing error messages. Generally
> speaking, the only "correct" syntax error message seems to be: "the
> input text could not be matched as a ModuleDeclaration".

I agree.

> Picking on words, I still hold that my example contains restricted
> keywords as terminals
> in a ModuleDeclaration where they are apparently interpreted (by javac)
> as identifiers:
>
>      module module {            // second word is an identifier
>          requires requires;     // second word is an identifier
>          exports to to exports; // words #2 and #4 are identifiers
>          uses module;           // second word is an identifier
>          provides uses with to; // words #2 and #4 are identifiers
>      }
>
> So, am I possibly still barking up the wrong tree?
> Or should the definition really say
>     "where they appear as _keywords_ in ModuleDeclaration"

OK, I think I see what you're saying. On the first line, you're saying 
that the second 'module' word is parsed by javac as an identifier _and 
Identifier is a terminal symbol on the RHS of ModuleDeclaration_ so why 
does lang-vm say to parse it as a keyword ("they are keywords solely 
where they appear as terminals in ModuleDeclaration") ?

Make a close reading of JLS 2.2. It's true that identifiers are terminal 
symbols of the _syntactic grammar_. However, I want to speak of 'open', 
'module', et al as terminal symbols of the _lexical grammar_. Basically 
I mean "a fixed width font presentation of a reserved lexeme spelled o p 
e n".

[Sidebar. The first sentence of JLS 2.4 says that terminal symbols of 
the _syntactic_ grammar are shown in fixed width font, but plainly, 
Identifier is never shown in fixed width font despite being deemed a 
terminal symbol of the syntactic grammar in 2.2. I've tried various ways 
of rewriting 2.4 -- "In the syntactic grammar, some terminal symbols are 
shown in fixed width font, namely keywords + separators + operators + 
certain literals, while other terminal symbols are shown in italic type, 
namely identifiers + certain other literals." -- but it doesn't feel 
very satisfying.]

FWIW I don't want to introduce a new kind of token in the syntactic 
grammar -- identifiers, keywords, ***modwords***, literals, separators, 
operators -- because this whole issue is particular to module 
declarations whose mass is not so large as to move the center of gravity 
of the language, with its five token kinds.

Alex