JDK-8254073, unicode escape preprocessing, and \u005C

Mon Jun 21 23:41:41 UTC 2021

There's no question that the first six raw input characters \ u 0 0 5 c 
are identified as a Unicode escape \u005c and translated to a backslash.

The question is whether that backslash is then treated as:

1. a raw input character \ that is followed by seven more raw input 
characters \ \ u 0 0 5 d   For these *eight* raw input characters, there 
are three raw input character \'s in a row. Due to contiguous-\ 
counting, the third raw input character \ is eligible to begin a Unicode 
escape; the first and second pass through and you get \ \ ] which 
further translates within a string literal as \]

or

2. something which is independent of the subsequent seven raw input 
characters \ \ u 0 0 5 d   For those *seven* subsequent raw input 
characters, there are two raw input character \'s in a row. Due to 
contiguous-\ counting, the second raw input character \ is not eligible 
to begin a Unicode escape, so all seven raw input characters pass 
through. You get (including the first "independent" backslash) \ \ \ u 0 
0 5 d

The contiguous-\ counting is due to the fact that \\ is the escape 
sequence for backslash in a string literal, so we don't want too many 
raw \ input character to "disappear" into Unicode escapes.

The JDK 15 behavior was #1. That looks correct to me. \ u 0 0 5 c 
becomes a raw input character \ that cannot serve as the opening 
backslash for an *immediate* Unicode escape (the classic JLS 3.3 
scenario of \u005cu005a) but that can serve as a raw input character for 
the purpose of skipping over \\ pairs (the purpose of contiguous-\ 
counting) in order for a *later* Unicode escape to be recognized (\u005d).

> Does "how many other \ characters contiguously precede it" refer to
> preceding raw input characters, or does it refer to preceding
> characters after unicode escape processing is performed on them?

Where JLS 3.3 says "translating the ASCII characters \u followed by four 
hexadecimal digits to the UTF-16 code unit (§3.1) for the indicated 
hexadecimal value", it really means "translating the ASCII characters \u 
followed by four hexadecimal digits to *a raw input character which 
denotes* the UTF-16 code unit (§3.1) for the indicated hexadecimal value".

Thus, the later clause "for each raw input character that is a backslash 
\, input processing must consider how many other [raw input] \ 
characters contiguously precede it" can be seen more easily to include 
characters that result from Unicode escape processing.

Alex

On 6/21/2021 2:56 PM, Jim Laskey wrote:
> "\u005C” should have been treated as a backslash. Will check into it.
> 
> Cheers,
> 
> — Jim
> 
> ��
> 
>> On Jun 21, 2021, at 6:28 PM, Liam Miller-Cushon <cushon at google.com> wrote:
>>
>> 
>> class T {
>>    public static void main(String[] args) {
>>      System.err.println("\u005C\\u005D");
>>    }
>> }
>>
>> Before JDK-8254073, this prints `\]`.
>>
>> After JDK-8254073, unicode escape processing results in `\\\u005D`, which results in an 'invalid escape' error for `\u`. Was that deliberate?
>>
>> JLS 3.3 says
>>
>>> for each raw input character that is a backslash \, input processing must consider how many other \ characters contiguously precede it, separating it from a non-\ character or the start of the input stream. If this number is even, then the \ is eligible to begin a Unicode escape; if the number is odd, then the \ is not eligible to begin a Unicode escape.
>>
>> The difference is in whether `\u005C` (the unicode escape for `\`) counts as one of the `\` preceding a valid unicode escape.
>>
>> Does "how many other \ characters contiguously precede it" refer to preceding raw input characters, or does it refer to preceding characters after unicode escape processing is performed on them?
>>
>> JLS 3.3 also mentions that a "character produced by a Unicode escape does not participate in further Unicode escapes", but I'm not sure if that applies here, since in the pre-JDK-8254073 interpretation the unicode-escaped backslash isn't really 'participating' in the second unicode escape.
>>
>> Thanks,
>> Liam