An alternative to "restricted keywords"

Tue May 9 14:56:11 UTC 2017

(1) I understand the need for avoiding that new module-related
keywords conflict with existing code, where these words may be used
as identifiers. Moreover, it must be possible for a module declaration
to refer to packages or types thusly named.

However,

(2) The currently proposed "restricted keywords" are not appropriately
specified in JLS.

(3) The currently proposed "restricted keywords" pose difficulties to
the implementation of all tools that need to parse a module declaration.

(4) A simple alternative to "restricted keywords" exists, which has not
received the attention it deserves.

Details:

(2) The current specification implicitly violates the assumption that
parsing can be performed on the basis of a token stream produced by
a scanner (aka lexer). From discussion on this list we learned that
the following examples are intended to be syntactically legal:
    module m { requires transitive transitive; }
    module m { requires transitive; }
(Please for the moment disregard heuristic solutions, while we are
  investigating whether generally "restricted keywords" is a well-defined
  concept, or not.)
Of the three occurrences of "transitive", #1 is a keyword, the others
are identifiers. At the point when the parser has consumed "requires"
and now asks about classification of the word "transitive", the scanner
cannot possible answer this classification. It can only answer for sure,
after the *parser* has accepted the full declaration. Put differently,
the parser must consume more tokens than have been classified by the
Scanner. Put differently, to faithfully parse arbitrary grammars using
a concept of "restricted keywords", scanners must provide speculative
answers, which may later need to be revised by backtracking or similar
exhaustive exploration of the space of possible interpretations.

The specification is totally silent about this fundamental change.

(3) "restricted keywords" pose three problems to tool implementations:

(3.a) Any known practical approach to implement a parser with
"restricted keywords" requires to leverage heuristics, which are based
on the exact set of rules defined in the grammar. Such heuristics
reduce the look-ahead that needs to be performed by the scanner,
in order to avoid the full exhaustive exploration mentioned above.
A set of such heuristic is extremely fragile and can easily break when
later more rules are added to the grammar. This means small future
language changes can easily break any chosen strategy.

(3.b) If parsing works for error-free input, this doesn't imply that
a parser will be able to give any useful answer for input with syntax
errors. As a worst-case example consider an arbitrary input sequence
consisting of just the two words "requires" and "transitive" in random
order and with no punctuation.
A parser will not be able to detect any structure in this sequence.
By comparison, normal keywords serve as a baseline, where parsing
typically can resume regardless of any leading garbage.
While this is not relevant for normal compilation, it is paramount
for assistive functions, which most of the time operate on incomplete
text, likely to contain even syntax errors.
Strictly speaking, any "module declaration" with syntax errors is
not a ModuleDeclaration, and thus none of the "restrictive keywords"
can be interpreted as keywords (which per JLS can only happen inside
a ModuleDeclaration).
All this means, that functionality like code completion is
systematically broken in a language using "restricted keywords".

(3.c) Other IDE functionality assumes that small fragments of the
input text can be scanned out of context. The classical example here
is syntax highlighting but there are more examples.
Any such functionality has to be re-implemented, replacing the
highly efficient local scanning with full parsing of the input text.
For functionality that is implicitly invoked per keystroke, or on
mouse hover etc, this difference in efficiency negatively affects
the overall user experience of an IDE.

(4) The following proposal avoids all difficulties described above:

* open, module, requires, transitive, exports, opens, to, uses,
    provides, and with are "module words", to which the following
    interpretation is applied:
    * within any ordinary compilation unit, a module word is a normal
      identifier.
    * within a modular compilation unit, all module words are
      (unconditional) keywords.
* We introduce three new auxiliary non-terminals:
      LegacyPackageName:
          LegacyIdentifier
          LegacyPackageName . LegacyIdentifier
      LegacyTypeName:
          LegacyIdentifier
          LegacyTypeName . LegacyIdentifier
      LegacyIdentifier:
          Identifier
          ^open
          ^module
          ...
          ^with
* We modify all productions in 7.7, replacing PackageName with
   LegacyPackageName and replacing TypeName with LegacyTypeName.
* After parsing, each of the words '^open', '^module' etc.
   is interpreted by removing the leading '^' (escape character).

Here, '^' is chosen as the escape character following the precedent
of Xtext. Plenty of other options for this purpose are possible, too.

This proposal completely satisfies the requirements (1), and avoids
all of the problems (2) and (3). There's an obvious price to pay:
users will have to add the escape character when referring to code
that uses a module word as a package name or type name.

Not only is this a very low price compared to the benefits; one can
even argue that it also helps the human reader of a module declaration,
because it clearly marks which occurrences of a module word are indeed
identifiers.

An IDE can easily help in interactively adding escapes where necessary.

Finally, in this trade-off it is relevant to consider the expected
frequencies: legacy names (needing escape) will surely be the exception
- by magnitudes. So, the little price needing to be paid, will only
affect a comparatively small number of locations.

Stephan