JDK-8254073, unicode escape preprocessing, and \u005C

Tue Jul 6 23:26:40 UTC 2021

(I slightly reordered the table in JDK-8269290 in support of the 
following exposition.)

Since time immemorial, X = "\u005C\\u005D" has printed `\]` on the 
grounds that the six opening characters \ u 0 0 5 C form a backslash for 
the purpose of counting how many backslash characters contiguously 
precede the backslash in the final six characters \ u 0 0 5 D. (Two, 
making the backslash in \ u 0 0 5 D eligible to begin a Unicode escape.)

Given Y = "\u005C\u005C\u005D", it's consistent for the six opening 
characters \ u 0 0 5 C to again form a backslash for the purpose of 
counting how many backslash characters contiguously precede the 
backslash in the middle six characters \ u 0 0 5 C. Thus, 
"\u005C\u005C..." is treated the same as "\\u005C...".

I acknowledge this is an incompatible change, but consider the 
alternative. If the six opening characters *didn't* contribute a 
backslash to the count for \ u 0 0 5 C in the Y case, then the same six 
opening characters wouldn't contribute a backslash to the count for \ u 
0 0 5 D in the X case. (In this alternative universe, people take the 
rule "The character produced by a Unicode escape does not participate in 
further Unicode escapes." literally.) Thus, in the X case, there would 
only be one backslash, denoted by the ASCII character, preceding the 
final six characters \ u 0 0 5 D  ==>  the \ in \ u 0 0 5 D would not be 
eligible  ==>  X would lex as \ \ \ u 0 0 5 D and print as `\\u005D` 
which is plain wrong.

(Maybe there is some application of the "longest possible translation" 
rule from 3.2 that lets the same six opening characters become a 
backslash-that-counts in X but not become a backslash-that-counts in Y. 
However, I do not know how to describe that application.)

Here's another test case for the CSR. JDK 15 does this:

jshell> System.out.println("\\u005D");
\u005D

jshell> System.out.println("\u005C\u005D");
|  Error:
|  illegal escape character
|  System.out.println("\u005C\u005D");
|                                 ^

With my consistency-first approach, the Z = "\u005C\u005D" case is 
legal, which seems far more reasonable than illegal. The six opening 
characters \ u 0 0 5 C form a backslash for the purpose of counting how 
many backslash characters contiguously precede the backslash in the 
final six characters \ u 0 0 5 D. (One, meaning the backslash in \ u 0 0 
5 D is not eligible to begin a Unicode escape.) The result would be \ \ 
u 0 0 5 D which would print as `\u005D`.

Net net, I favor the correct fix -- and lots more test cases in the JCK.

Alex

On 7/2/2021 5:47 AM, Jim Laskey wrote:
> Just so it doesn't look like I went rogue with the bug fix 
> (https://bugs.openjdk.java.net/browse/JDK-8269290 
> <https://bugs.openjdk.java.net/browse/JDK-8269290>), I would like a 
> consensus ruling on which is the bug fix I should use;
> 
> correct fix:
> 
> interpretAsPerJLS();
> 
> 
> faithful fix:
> 
> if (sourceLevel <= 15)
>      interpretOldWay();
>          else
>      interpretAsPerJLS();
> 
> status quo fix:
> 
> interpretOldWay();
> 
> I'm assuming correct fix, but others may have different assumptions.
> 
> Cheers,
> 
> -- Jim
> 
>> On Jun 25, 2021, at 4:04 PM, Alex Buckley <alex.buckley at oracle.com 
>> <mailto:alex.buckley at oracle.com>> wrote:
>>
>> I filed https://bugs.openjdk.java.net/browse/JDK-8269406 
>> <https://bugs.openjdk.java.net/browse/JDK-8269406> with some 
>> additional discussion about what the result of the first lexical 
>> translation step is really meant to be.
>>
>> Please take a look if you are familiar with the three-step translation 
>> described in JLS 3.2, and care about how the input stream is processed.
>>
>> Alex
>>
>> On 6/22/2021 10:38 AM, Alex Buckley wrote:
>>> I am minded to extend the final note in JLS 3.3 to help people 
>>> understand the multi-level escape story in play when they experiment 
>>> with Unicode escapes. Perhaps it will also improve some javac error 
>>> messages or test cases. Let me know what you think of this:
>>> -----
>>> For example, the input stream \u005cu005a results in the six 
>>> characters \ u 0 0 5 a, because 005c is the Unicode value for \. It 
>>> does not result in the character Z, which is Unicode character 005a, 
>>> because the \ that resulted from the \u005c is not interpreted as the 
>>> start of a further Unicode escape.
>>> Note that \u005cu005a cannot be written in a string literal to denote 
>>> the six characters \ u 0 0 5 a. This is because the first two 
>>> characters resulting from translation, \ and u, are interpreted in a 
>>> string literal as an illegal escape sequence (3.10.7).
>>> Fortunately, the rule about contiguous \ characters helps programmers 
>>> to craft input streams that denote Unicode escapes in a string 
>>> literal. Denoting the six characters \ u 0 0 5 a in a string literal 
>>> simply requires another \ to be written adjacent to the existing \, 
>>> such as in "Z is \\u005a". This works because the second \ in the 
>>> input stream \\u005a is not eligible, so the first \ and second \ are 
>>> preserved as raw input characters; they are subsequently interpreted 
>>> in a string literal as the escape sequence for a backslash, resulting 
>>> in the desired six characters \ u 0 0 5 a. Without the rule, the 
>>> input stream \\u005a would be translated as the raw input character \ 
>>> followed by the Unicode escape \u005a (Z), but \Z is an illegal 
>>> escape sequence in a string literal.
>>> The rule also allows programmers to craft input streams that denote 
>>> escape sequences in a string literal. For example, the input stream 
>>> \\\u006e results in the three characters \ \ n because the third \ is 
>>> eligible and thus \u006e is translated to n, while the first \ and 
>>> second \ are preserved as raw input characters. The three characters 
>>> \ \ n are subsequently interpreted in a string literal as \ n which 
>>> denotes the escape sequence for a linefeed. (The input stream 
>>> \\\u006e may also be written as \u005c\u005c\u006e.)
>>> -----
>>> Alex
>>> On 6/21/2021 4:41 PM, Alex Buckley wrote:
>>>> There's no question that the first six raw input characters \ u 0 0 
>>>> 5 c are identified as a Unicode escape \u005c and translated to a 
>>>> backslash.
>>>>
>>>> The question is whether that backslash is then treated as:
>>>>
>>>> 1. a raw input character \ that is followed by seven more raw input 
>>>> characters \ \ u 0 0 5 d   For these *eight* raw input characters, 
>>>> there are three raw input character \'s in a row. Due to 
>>>> contiguous-\ counting, the third raw input character \ is eligible 
>>>> to begin a Unicode escape; the first and second pass through and you 
>>>> get \ \ ] which further translates within a string literal as \]
>>>>
>>>> or
>>>>
>>>> 2. something which is independent of the subsequent seven raw input 
>>>> characters \ \ u 0 0 5 d   For those *seven* subsequent raw input 
>>>> characters, there are two raw input character \'s in a row. Due to 
>>>> contiguous-\ counting, the second raw input character \ is not 
>>>> eligible to begin a Unicode escape, so all seven raw input 
>>>> characters pass through. You get (including the first "independent" 
>>>> backslash) \ \ \ u 0 0 5 d
>>>>
>>>>
>>>> The contiguous-\ counting is due to the fact that \\ is the escape 
>>>> sequence for backslash in a string literal, so we don't want too 
>>>> many raw \ input character to "disappear" into Unicode escapes.
>>>>
>>>>
>>>> The JDK 15 behavior was #1. That looks correct to me. \ u 0 0 5 c 
>>>> becomes a raw input character \ that cannot serve as the opening 
>>>> backslash for an *immediate* Unicode escape (the classic JLS 3.3 
>>>> scenario of \u005cu005a) but that can serve as a raw input character 
>>>> for the purpose of skipping over \\ pairs (the purpose of 
>>>> contiguous-\ counting) in order for a *later* Unicode escape to be 
>>>> recognized (\u005d).
>>>>
>>>>> Does "how many other \ characters contiguously precede it" refer to
>>>>> preceding raw input characters, or does it refer to preceding
>>>>> characters after unicode escape processing is performed on them?
>>>>
>>>> Where JLS 3.3 says "translating the ASCII characters \u followed by 
>>>> four hexadecimal digits to the UTF-16 code unit (§3.1) for the 
>>>> indicated hexadecimal value", it really means "translating the ASCII 
>>>> characters \u followed by four hexadecimal digits to *a raw input 
>>>> character which denotes* the UTF-16 code unit (§3.1) for the 
>>>> indicated hexadecimal value".
>>>>
>>>> Thus, the later clause "for each raw input character that is a 
>>>> backslash \, input processing must consider how many other [raw 
>>>> input] \ characters contiguously precede it" can be seen more easily 
>>>> to include characters that result from Unicode escape processing.
>>>>
>>>> Alex
>>>>
>>>> On 6/21/2021 2:56 PM, Jim Laskey wrote:
>>>>> "\u005C” should have been treated as a backslash. Will check into it.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> — Jim
>>>>>
>>>>> ��
>>>>>
>>>>>> On Jun 21, 2021, at 6:28 PM, Liam Miller-Cushon <cushon at google.com 
>>>>>> <mailto:cushon at google.com>> wrote:
>>>>>>
>>>>>> 
>>>>>> class T {
>>>>>>    public static void main(String[] args) {
>>>>>>      System.err.println("\u005C\\u005D");
>>>>>>    }
>>>>>> }
>>>>>>
>>>>>> Before JDK-8254073, this prints `\]`.
>>>>>>
>>>>>> After JDK-8254073, unicode escape processing results in 
>>>>>> `\\\u005D`, which results in an 'invalid escape' error for `\u`. 
>>>>>> Was that deliberate?
>>>>>>
>>>>>> JLS 3.3 says
>>>>>>
>>>>>>> for each raw input character that is a backslash \, input 
>>>>>>> processing must consider how many other \ characters contiguously 
>>>>>>> precede it, separating it from a non-\ character or the start of 
>>>>>>> the input stream. If this number is even, then the \ is eligible 
>>>>>>> to begin a Unicode escape; if the number is odd, then the \ is 
>>>>>>> not eligible to begin a Unicode escape.
>>>>>>
>>>>>> The difference is in whether `\u005C` (the unicode escape for `\`) 
>>>>>> counts as one of the `\` preceding a valid unicode escape.
>>>>>>
>>>>>> Does "how many other \ characters contiguously precede it" refer 
>>>>>> to preceding raw input characters, or does it refer to preceding 
>>>>>> characters after unicode escape processing is performed on them?
>>>>>>
>>>>>> JLS 3.3 also mentions that a "character produced by a Unicode 
>>>>>> escape does not participate in further Unicode escapes", but I'm 
>>>>>> not sure if that applies here, since in the pre-JDK-8254073 
>>>>>> interpretation the unicode-escaped backslash isn't really 
>>>>>> 'participating' in the second unicode escape.
>>>>>>
>>>>>> Thanks,
>>>>>> Liam
>