JDK-8254073, unicode escape preprocessing, and \u005C

Tue Jun 22 17:38:08 UTC 2021

I am minded to extend the final note in JLS 3.3 to help people 
understand the multi-level escape story in play when they experiment 
with Unicode escapes. Perhaps it will also improve some javac error 
messages or test cases. Let me know what you think of this:

-----
For example, the input stream \u005cu005a results in the six characters 
\ u 0 0 5 a, because 005c is the Unicode value for \. It does not result 
in the character Z, which is Unicode character 005a, because the \ that 
resulted from the \u005c is not interpreted as the start of a further 
Unicode escape.

Note that \u005cu005a cannot be written in a string literal to denote 
the six characters \ u 0 0 5 a. This is because the first two characters 
resulting from translation, \ and u, are interpreted in a string literal 
as an illegal escape sequence (3.10.7).

Fortunately, the rule about contiguous \ characters helps programmers to 
craft input streams that denote Unicode escapes in a string literal. 
Denoting the six characters \ u 0 0 5 a in a string literal simply 
requires another \ to be written adjacent to the existing \, such as in 
"Z is \\u005a". This works because the second \ in the input stream 
\\u005a is not eligible, so the first \ and second \ are preserved as 
raw input characters; they are subsequently interpreted in a string 
literal as the escape sequence for a backslash, resulting in the desired 
six characters \ u 0 0 5 a. Without the rule, the input stream \\u005a 
would be translated as the raw input character \ followed by the Unicode 
escape \u005a (Z), but \Z is an illegal escape sequence in a string literal.

The rule also allows programmers to craft input streams that denote 
escape sequences in a string literal. For example, the input stream 
\\\u006e results in the three characters \ \ n because the third \ is 
eligible and thus \u006e is translated to n, while the first \ and 
second \ are preserved as raw input characters. The three characters \ \ 
n are subsequently interpreted in a string literal as \ n which denotes 
the escape sequence for a linefeed. (The input stream \\\u006e may also 
be written as \u005c\u005c\u006e.)
-----

Alex

On 6/21/2021 4:41 PM, Alex Buckley wrote:
> There's no question that the first six raw input characters \ u 0 0 5 c 
> are identified as a Unicode escape \u005c and translated to a backslash.
> 
> The question is whether that backslash is then treated as:
> 
> 1. a raw input character \ that is followed by seven more raw input 
> characters \ \ u 0 0 5 d   For these *eight* raw input characters, there 
> are three raw input character \'s in a row. Due to contiguous-\ 
> counting, the third raw input character \ is eligible to begin a Unicode 
> escape; the first and second pass through and you get \ \ ] which 
> further translates within a string literal as \]
> 
> or
> 
> 2. something which is independent of the subsequent seven raw input 
> characters \ \ u 0 0 5 d   For those *seven* subsequent raw input 
> characters, there are two raw input character \'s in a row. Due to 
> contiguous-\ counting, the second raw input character \ is not eligible 
> to begin a Unicode escape, so all seven raw input characters pass 
> through. You get (including the first "independent" backslash) \ \ \ u 0 
> 0 5 d
> 
> 
> The contiguous-\ counting is due to the fact that \\ is the escape 
> sequence for backslash in a string literal, so we don't want too many 
> raw \ input character to "disappear" into Unicode escapes.
> 
> 
> The JDK 15 behavior was #1. That looks correct to me. \ u 0 0 5 c 
> becomes a raw input character \ that cannot serve as the opening 
> backslash for an *immediate* Unicode escape (the classic JLS 3.3 
> scenario of \u005cu005a) but that can serve as a raw input character for 
> the purpose of skipping over \\ pairs (the purpose of contiguous-\ 
> counting) in order for a *later* Unicode escape to be recognized (\u005d).
> 
>> Does "how many other \ characters contiguously precede it" refer to
>> preceding raw input characters, or does it refer to preceding
>> characters after unicode escape processing is performed on them?
> 
> Where JLS 3.3 says "translating the ASCII characters \u followed by four 
> hexadecimal digits to the UTF-16 code unit (§3.1) for the indicated 
> hexadecimal value", it really means "translating the ASCII characters \u 
> followed by four hexadecimal digits to *a raw input character which 
> denotes* the UTF-16 code unit (§3.1) for the indicated hexadecimal value".
> 
> Thus, the later clause "for each raw input character that is a backslash 
> \, input processing must consider how many other [raw input] \ 
> characters contiguously precede it" can be seen more easily to include 
> characters that result from Unicode escape processing.
> 
> Alex
> 
> On 6/21/2021 2:56 PM, Jim Laskey wrote:
>> "\u005C” should have been treated as a backslash. Will check into it.
>>
>> Cheers,
>>
>> — Jim
>>
>> ��
>>
>>> On Jun 21, 2021, at 6:28 PM, Liam Miller-Cushon <cushon at google.com> 
>>> wrote:
>>>
>>> 
>>> class T {
>>>    public static void main(String[] args) {
>>>      System.err.println("\u005C\\u005D");
>>>    }
>>> }
>>>
>>> Before JDK-8254073, this prints `\]`.
>>>
>>> After JDK-8254073, unicode escape processing results in `\\\u005D`, 
>>> which results in an 'invalid escape' error for `\u`. Was that 
>>> deliberate?
>>>
>>> JLS 3.3 says
>>>
>>>> for each raw input character that is a backslash \, input processing 
>>>> must consider how many other \ characters contiguously precede it, 
>>>> separating it from a non-\ character or the start of the input 
>>>> stream. If this number is even, then the \ is eligible to begin a 
>>>> Unicode escape; if the number is odd, then the \ is not eligible to 
>>>> begin a Unicode escape.
>>>
>>> The difference is in whether `\u005C` (the unicode escape for `\`) 
>>> counts as one of the `\` preceding a valid unicode escape.
>>>
>>> Does "how many other \ characters contiguously precede it" refer to 
>>> preceding raw input characters, or does it refer to preceding 
>>> characters after unicode escape processing is performed on them?
>>>
>>> JLS 3.3 also mentions that a "character produced by a Unicode escape 
>>> does not participate in further Unicode escapes", but I'm not sure if 
>>> that applies here, since in the pre-JDK-8254073 interpretation the 
>>> unicode-escaped backslash isn't really 'participating' in the second 
>>> unicode escape.
>>>
>>> Thanks,
>>> Liam