An alternative to "restricted keywords"

Tue May 16 09:49:45 UTC 2017

Thanks, Remi, for taking this to the EG list.

Some collected responses:

Remi: "from the user point of view, '^' looks like a hack"

This is, of course, a subjective statement. I don't share this view
and in years of experience with Xtext-languages (where this concept
is used by default) I never heard any user complain about this.

More importantly, I hold that such aesthetic considerations are of
much lesser significance than the question, whether we can explain
- unambiguously explain - the concept in a few simple sentences.
Explaining must be possible at two levels: in a rigorous specification
and in simple words for users of the language.

Remi: "a keyword which is activated if you are at a position in the
  grammar where it can be recognized".

I don't think 'being at a position in the grammar' is a good way of
explaining. Parsing doesn't generally have one position in a grammar,
multiple productions can be active in the same parser state.
Also speaking of a "loop" for modifiers seems to complicate matters
more than necessary.

Under these considerations I still see '^' as the clearest of all
solutions. Clear as a specification, simple to explain to users.

Peter spoke about module names vs. package names.

I think we agree, that module names cannot use "module words",
whereas package names should be expected to contain them.

Remi: "you should use reverse DNS naming for package so no problem :)"

"to" is a "module word" and a TLD.
I think we should be very careful in judging that a existing conflict
is not a real problem. Better to clearly and rigorously avoid the
conflict in the first place.

Some additional notes from my side:

In the escape-approach, it may be prudent to technically allow
escaping even words that are identifiers in Java 9, but could become
keywords in a future version. This ensures that modules which need
more escaping in Java 9+X can still be parsed in Java 9.

Current focus was on names of modules, packages and types.
A complete solution must also give an answer for annotations on modules.
Some possible solutions:
a. Assume that annotations for modules are designed with modules in mind
    and thus have to avoid any module words in their names.
b. Support escaping also in annotations
c. Refine the scope where "module words" are keywords, let it start only
    when the word "module" or the group "open module" has been consumed.
    This would make the words "module" and "open" special, as being
    switch words, where we switch from one language to another.
    (For this I previously coined the term "scoped keywords" [1])

I think we all agree that the conflicts we are solving here are rare
corner cases. Most names do not contain module words. Still, from a
conceptual and technical p.o.v. the solution must be bullet proof.
But there's no need to be afraid of module declarations being spammed
with dozens of '^' characters. Realistically, this will not happen.

Stephan

[1] http://www.objectteams.org/def/1.3/sA.html#sA.0.1

On 12.05.2017 21:21, Remi Forax wrote:
> Hi Peter,
>
> On May 12, 2017 6:08:58 PM GMT+02:00, Peter Levart <peter.levart at gmail.com> wrote:
>> Hi Remi,
>>
>> On 05/12/2017 08:17 AM, Remi Forax wrote:
>>> [CC JPMS expert mailing list because, it's an important issue IMO]
>>>
>>> I've a counter proposition.
>>>
>>> I do not like your proposal because from the user point of view, '^'
>> looks like a hack, it's not used anywhere else in the grammar.
>>> I agree that restricted keywords are not properly specified in JLS.
>> Reading your mail, i've discovered that what i was calling restricted
>> keywords is not what javac implements :(
>>> I agree that restricted keywords should be only enabled when parsing
>> module-info.java
>>> I agree that doing error recovery on the way the grammar for
>> module-info is currently implemented in javac leads to less than ideal
>> error messages.
>>>
>>> In my opinion, both
>>>     module m { requires transitive transitive; }
>>>     module m { requires transitive; }
>>> should be rejected because what javac implements something more close
>> to the javascript ASI rules than restricted keywords as currently
>> specified by Alex.
>>>
>>> For me, a restricted keyword is a keyword which is activated if you
>> are at a position in the grammar where it can be recognized and because
>> it's a keyword, it tooks over an identifier.
>>> by example for
>>>    module m {
>>> if the next token is 'requires', it should be recognized as a keyword
>> because you can parse a directive 'required ...' so there is a
>> production that will starts with the 'required' keyword.
>>>
>>> so
>>>    module m { requires transitive; }
>>> should be rejected because transitive should be recognized as a
>> keyword after requires and the compiler should report a missing module
>> name.
>>>
>>> and
>>>    module m { requires transitive transitive; }
>>> should be rejected because the grammar that parse the modifiers is
>> defined as "a loop" so from the grammar point of view it's like
>>>    module m { requires Modifier Modifier; }
>>> so the the front end of the compiler should report a missing module
>> name and a later phase should report that there is twice the same
>> modifier 'transitive'.
>>>
>>> I believe that with this definition of 'restricted keyword', compiler
>> can recover error more easily and offers meaningful error message and
>> the module-info part of the grammar is LR(1).
>>
>> This will make "requires", "uses", "provides", "with", "to", "static",
>> "transitive", "exports", etc .... all illegal module names. Ok, no big
>> deal, because there are no module names yet (apart from JDK modules and
>>
>> those are named differently). But...
>
> you should use reverse DNS naming for module name, so no problem.
>
>>
>> What about:
>>
>> module m { exports transitive; }
>>
>> Here 'transitive' is an existing package name for example. Who
>> guarantees that there are no packages out there with names matching
>> restricted keywords? Current restriction for modules is that they can
>> not have an unnamed package. Do we want to restrict package names a
>> module can export too?
>
> you should use reverse DNS naming for package so no problem :)
>
>>
>> Stephan's solution does not have this problem.
>>
>> Regards, Peter
>
> I think those issues are not real problem.
>
> Rémi
>
>>
>>>
>>> regards,
>>> Rémi
>>>
>>> ----- Mail original -----
>>>> De: "Stephan Herrmann" <stephan.herrmann at berlin.de>
>>>> À: jigsaw-dev at openjdk.java.net
>>>> Envoyé: Mardi 9 Mai 2017 16:56:11
>>>> Objet: An alternative to "restricted keywords"
>>>> (1) I understand the need for avoiding that new module-related
>>>> keywords conflict with existing code, where these words may be used
>>>> as identifiers. Moreover, it must be possible for a module
>> declaration
>>>> to refer to packages or types thusly named.
>>>>
>>>> However,
>>>>
>>>> (2) The currently proposed "restricted keywords" are not
>> appropriately
>>>> specified in JLS.
>>>>
>>>> (3) The currently proposed "restricted keywords" pose difficulties
>> to
>>>> the implementation of all tools that need to parse a module
>> declaration.
>>>>
>>>> (4) A simple alternative to "restricted keywords" exists, which has
>> not
>>>> received the attention it deserves.
>>>>
>>>> Details:
>>>>
>>>> (2) The current specification implicitly violates the assumption
>> that
>>>> parsing can be performed on the basis of a token stream produced by
>>>> a scanner (aka lexer). From discussion on this list we learned that
>>>> the following examples are intended to be syntactically legal:
>>>>     module m { requires transitive transitive; }
>>>>     module m { requires transitive; }
>>>> (Please for the moment disregard heuristic solutions, while we are
>>>>   investigating whether generally "restricted keywords" is a
>> well-defined
>>>>   concept, or not.)
>>>> Of the three occurrences of "transitive", #1 is a keyword, the
>> others
>>>> are identifiers. At the point when the parser has consumed
>> "requires"
>>>> and now asks about classification of the word "transitive", the
>> scanner
>>>> cannot possible answer this classification. It can only answer for
>> sure,
>>>> after the *parser* has accepted the full declaration. Put
>> differently,
>>>> the parser must consume more tokens than have been classified by the
>>>> Scanner. Put differently, to faithfully parse arbitrary grammars
>> using
>>>> a concept of "restricted keywords", scanners must provide
>> speculative
>>>> answers, which may later need to be revised by backtracking or
>> similar
>>>> exhaustive exploration of the space of possible interpretations.
>>>>
>>>> The specification is totally silent about this fundamental change.
>>>>
>>>>
>>>> (3) "restricted keywords" pose three problems to tool
>> implementations:
>>>>
>>>> (3.a) Any known practical approach to implement a parser with
>>>> "restricted keywords" requires to leverage heuristics, which are
>> based
>>>> on the exact set of rules defined in the grammar. Such heuristics
>>>> reduce the look-ahead that needs to be performed by the scanner,
>>>> in order to avoid the full exhaustive exploration mentioned above.
>>>> A set of such heuristic is extremely fragile and can easily break
>> when
>>>> later more rules are added to the grammar. This means small future
>>>> language changes can easily break any chosen strategy.
>>>>
>>>> (3.b) If parsing works for error-free input, this doesn't imply that
>>>> a parser will be able to give any useful answer for input with
>> syntax
>>>> errors. As a worst-case example consider an arbitrary input sequence
>>>> consisting of just the two words "requires" and "transitive" in
>> random
>>>> order and with no punctuation.
>>>> A parser will not be able to detect any structure in this sequence.
>>>> By comparison, normal keywords serve as a baseline, where parsing
>>>> typically can resume regardless of any leading garbage.
>>>> While this is not relevant for normal compilation, it is paramount
>>>> for assistive functions, which most of the time operate on
>> incomplete
>>>> text, likely to contain even syntax errors.
>>>> Strictly speaking, any "module declaration" with syntax errors is
>>>> not a ModuleDeclaration, and thus none of the "restrictive keywords"
>>>> can be interpreted as keywords (which per JLS can only happen inside
>>>> a ModuleDeclaration).
>>>> All this means, that functionality like code completion is
>>>> systematically broken in a language using "restricted keywords".
>>>>
>>>> (3.c) Other IDE functionality assumes that small fragments of the
>>>> input text can be scanned out of context. The classical example here
>>>> is syntax highlighting but there are more examples.
>>>> Any such functionality has to be re-implemented, replacing the
>>>> highly efficient local scanning with full parsing of the input text.
>>>> For functionality that is implicitly invoked per keystroke, or on
>>>> mouse hover etc, this difference in efficiency negatively affects
>>>> the overall user experience of an IDE.
>>>>
>>>>
>>>> (4) The following proposal avoids all difficulties described above:
>>>>
>>>> * open, module, requires, transitive, exports, opens, to, uses,
>>>>     provides, and with are "module words", to which the following
>>>>     interpretation is applied:
>>>>     * within any ordinary compilation unit, a module word is a
>> normal
>>>>       identifier.
>>>>     * within a modular compilation unit, all module words are
>>>>       (unconditional) keywords.
>>>> * We introduce three new auxiliary non-terminals:
>>>>       LegacyPackageName:
>>>>           LegacyIdentifier
>>>>           LegacyPackageName . LegacyIdentifier
>>>>       LegacyTypeName:
>>>>           LegacyIdentifier
>>>>           LegacyTypeName . LegacyIdentifier
>>>>       LegacyIdentifier:
>>>>           Identifier
>>>>           ^open
>>>>           ^module
>>>>           ...
>>>>           ^with
>>>> * We modify all productions in 7.7, replacing PackageName with
>>>>    LegacyPackageName and replacing TypeName with LegacyTypeName.
>>>> * After parsing, each of the words '^open', '^module' etc.
>>>>    is interpreted by removing the leading '^' (escape character).
>>>>
>>>> Here, '^' is chosen as the escape character following the precedent
>>>> of Xtext. Plenty of other options for this purpose are possible,
>> too.
>>>>
>>>>
>>>>
>>>> This proposal completely satisfies the requirements (1), and avoids
>>>> all of the problems (2) and (3). There's an obvious price to pay:
>>>> users will have to add the escape character when referring to code
>>>> that uses a module word as a package name or type name.
>>>>
>>>> Not only is this a very low price compared to the benefits; one can
>>>> even argue that it also helps the human reader of a module
>> declaration,
>>>> because it clearly marks which occurrences of a module word are
>> indeed
>>>> identifiers.
>>>>
>>>> An IDE can easily help in interactively adding escapes where
>> necessary.
>>>>
>>>> Finally, in this trade-off it is relevant to consider the expected
>>>> frequencies: legacy names (needing escape) will surely be the
>> exception
>>>> - by magnitudes. So, the little price needing to be paid, will only
>>>> affect a comparatively small number of locations.
>>>>
>>>>
>>>> Stephan
>