JDK-8254073, unicode escape preprocessing, and \u005C

Fri Jun 25 19:04:34 UTC 2021

I filed https://bugs.openjdk.java.net/browse/JDK-8269406 with some 
additional discussion about what the result of the first lexical 
translation step is really meant to be.

Please take a look if you are familiar with the three-step translation 
described in JLS 3.2, and care about how the input stream is processed.

Alex

On 6/22/2021 10:38 AM, Alex Buckley wrote:
> I am minded to extend the final note in JLS 3.3 to help people 
> understand the multi-level escape story in play when they experiment 
> with Unicode escapes. Perhaps it will also improve some javac error 
> messages or test cases. Let me know what you think of this:
> 
> -----
> For example, the input stream \u005cu005a results in the six characters 
> \ u 0 0 5 a, because 005c is the Unicode value for \. It does not result 
> in the character Z, which is Unicode character 005a, because the \ that 
> resulted from the \u005c is not interpreted as the start of a further 
> Unicode escape.
> 
> Note that \u005cu005a cannot be written in a string literal to denote 
> the six characters \ u 0 0 5 a. This is because the first two characters 
> resulting from translation, \ and u, are interpreted in a string literal 
> as an illegal escape sequence (3.10.7).
> 
> Fortunately, the rule about contiguous \ characters helps programmers to 
> craft input streams that denote Unicode escapes in a string literal. 
> Denoting the six characters \ u 0 0 5 a in a string literal simply 
> requires another \ to be written adjacent to the existing \, such as in 
> "Z is \\u005a". This works because the second \ in the input stream 
> \\u005a is not eligible, so the first \ and second \ are preserved as 
> raw input characters; they are subsequently interpreted in a string 
> literal as the escape sequence for a backslash, resulting in the desired 
> six characters \ u 0 0 5 a. Without the rule, the input stream \\u005a 
> would be translated as the raw input character \ followed by the Unicode 
> escape \u005a (Z), but \Z is an illegal escape sequence in a string 
> literal.
> 
> The rule also allows programmers to craft input streams that denote 
> escape sequences in a string literal. For example, the input stream 
> \\\u006e results in the three characters \ \ n because the third \ is 
> eligible and thus \u006e is translated to n, while the first \ and 
> second \ are preserved as raw input characters. The three characters \ \ 
> n are subsequently interpreted in a string literal as \ n which denotes 
> the escape sequence for a linefeed. (The input stream \\\u006e may also 
> be written as \u005c\u005c\u006e.)
> -----
> 
> Alex
> 
> On 6/21/2021 4:41 PM, Alex Buckley wrote:
>> There's no question that the first six raw input characters \ u 0 0 5 
>> c are identified as a Unicode escape \u005c and translated to a 
>> backslash.
>>
>> The question is whether that backslash is then treated as:
>>
>> 1. a raw input character \ that is followed by seven more raw input 
>> characters \ \ u 0 0 5 d   For these *eight* raw input characters, 
>> there are three raw input character \'s in a row. Due to contiguous-\ 
>> counting, the third raw input character \ is eligible to begin a 
>> Unicode escape; the first and second pass through and you get \ \ ] 
>> which further translates within a string literal as \]
>>
>> or
>>
>> 2. something which is independent of the subsequent seven raw input 
>> characters \ \ u 0 0 5 d   For those *seven* subsequent raw input 
>> characters, there are two raw input character \'s in a row. Due to 
>> contiguous-\ counting, the second raw input character \ is not 
>> eligible to begin a Unicode escape, so all seven raw input characters 
>> pass through. You get (including the first "independent" backslash) \ 
>> \ \ u 0 0 5 d
>>
>>
>> The contiguous-\ counting is due to the fact that \\ is the escape 
>> sequence for backslash in a string literal, so we don't want too many 
>> raw \ input character to "disappear" into Unicode escapes.
>>
>>
>> The JDK 15 behavior was #1. That looks correct to me. \ u 0 0 5 c 
>> becomes a raw input character \ that cannot serve as the opening 
>> backslash for an *immediate* Unicode escape (the classic JLS 3.3 
>> scenario of \u005cu005a) but that can serve as a raw input character 
>> for the purpose of skipping over \\ pairs (the purpose of contiguous-\ 
>> counting) in order for a *later* Unicode escape to be recognized 
>> (\u005d).
>>
>>> Does "how many other \ characters contiguously precede it" refer to
>>> preceding raw input characters, or does it refer to preceding
>>> characters after unicode escape processing is performed on them?
>>
>> Where JLS 3.3 says "translating the ASCII characters \u followed by 
>> four hexadecimal digits to the UTF-16 code unit (§3.1) for the 
>> indicated hexadecimal value", it really means "translating the ASCII 
>> characters \u followed by four hexadecimal digits to *a raw input 
>> character which denotes* the UTF-16 code unit (§3.1) for the indicated 
>> hexadecimal value".
>>
>> Thus, the later clause "for each raw input character that is a 
>> backslash \, input processing must consider how many other [raw input] 
>> \ characters contiguously precede it" can be seen more easily to 
>> include characters that result from Unicode escape processing.
>>
>> Alex
>>
>> On 6/21/2021 2:56 PM, Jim Laskey wrote:
>>> "\u005C” should have been treated as a backslash. Will check into it.
>>>
>>> Cheers,
>>>
>>> — Jim
>>>
>>> ��
>>>
>>>> On Jun 21, 2021, at 6:28 PM, Liam Miller-Cushon <cushon at google.com> 
>>>> wrote:
>>>>
>>>> 
>>>> class T {
>>>>    public static void main(String[] args) {
>>>>      System.err.println("\u005C\\u005D");
>>>>    }
>>>> }
>>>>
>>>> Before JDK-8254073, this prints `\]`.
>>>>
>>>> After JDK-8254073, unicode escape processing results in `\\\u005D`, 
>>>> which results in an 'invalid escape' error for `\u`. Was that 
>>>> deliberate?
>>>>
>>>> JLS 3.3 says
>>>>
>>>>> for each raw input character that is a backslash \, input 
>>>>> processing must consider how many other \ characters contiguously 
>>>>> precede it, separating it from a non-\ character or the start of 
>>>>> the input stream. If this number is even, then the \ is eligible to 
>>>>> begin a Unicode escape; if the number is odd, then the \ is not 
>>>>> eligible to begin a Unicode escape.
>>>>
>>>> The difference is in whether `\u005C` (the unicode escape for `\`) 
>>>> counts as one of the `\` preceding a valid unicode escape.
>>>>
>>>> Does "how many other \ characters contiguously precede it" refer to 
>>>> preceding raw input characters, or does it refer to preceding 
>>>> characters after unicode escape processing is performed on them?
>>>>
>>>> JLS 3.3 also mentions that a "character produced by a Unicode escape 
>>>> does not participate in further Unicode escapes", but I'm not sure 
>>>> if that applies here, since in the pre-JDK-8254073 interpretation 
>>>> the unicode-escaped backslash isn't really 'participating' in the 
>>>> second unicode escape.
>>>>
>>>> Thanks,
>>>> Liam