Identifier Ignorable characters in keywords and literals

Wed Sep 23 00:10:11 UTC 2020

An ignorable Unicode escape such as `\u0001` is a legitimate character 
in a character literal, string literal, or text block, so javac accepts 
and translates it there. In contrast, it seems that javac accepts _and 
discards_ an ignorable Unicode escape:

1. in the body of a comment;
2. as a Java-letter-or-digit in an identifier (i.e., not as the first 
character of an identifier, but as any subsequent character);
3. in a position to the right of a non-ignorable character within a 
keyword (thus allowing for appearance at the end of a keyword, and for 
consecutive ignorable escapes: `class\u0001\u0001`);
4. in a position to the right of a non-ignorable character within a 
boolean literal or null literal.

1 and 2 are to spec. 3 and 4 are new to the spec. There seems to be a 
connection between 2 and 3+4: javac is expecting keywords to follow the 
same Java-letter-followed-by-Java-letters-or-digits format as identifiers.

Alex

On 9/22/2020 4:07 PM, Pravin Jain wrote:
> Thanks for the clarifications.
> But let me point out that the Identifier Ignorable characters are
> ignored not only in keywords but also in the three literals "true",
> "false" and "null"
> 
> Thanks and Regards,
> Pravin
> 
> On Tue, Sep 22, 2020 at 11:11 PM Alex Buckley <alex.buckley at oracle.com> wrote:
>>
>> // Adding Dan explicitly
>>
>> On 9/21/2020 10:39 PM, Pravin Jain wrote:
>>> The following code compiles and executes successfully.
>>>
>>> public cl\u0001ass Identifier\u0002Ignorable {
>>>       public sta\u0003tic vo\u0004id ma\u0005in(String[] args) {
>>>           System.out.println("Hello world");
>>>       }
>>> }
>>>
>>> The JLS mentions about the use of Identifier-Ignorable characters
>>> being allowed in an Identifier, but using those in a keyword, or
>>> literal has not been mentioned. From the specification, one does not
>>> gather that these characters will be ignored when used inside a
>>> keyword or a literal.y Is this error of compiler or the JLS has missed
>>> to clarify this point?
>>
>> It would be legitimate for JLS 3.3 to acknowledge that some `\uxxxx`
>> Unicode escapes represent UTF-16 code units which denote "ignorable"
>> code points; such UTF-16 code units are _not_ included in the sequence
>> of Unicode input characters resulting from this translation step.
>>
>> Dan, is it possible to make this small clarification in the JLS ch.3
>> update for contextual keywords?
>>
>> The text in 3.8 -- "Two identifiers are the same only if, after ignoring
>> characters that are ignorable, the identifiers have the same Unicode
>> character for each letter or digit." -- would be slightly redundant in
>> calling out ignorable characters, but it should not be changed because
>> it states a clear, easy-to-understand rule for Java programmers looking
>> to go beyond ASCII in their identifiers.
>>
>> Alex
> 
> 
>