JDK-8254073, unicode escape preprocessing, and \u005C
John Rose
john.r.rose at oracle.com
Fri Jun 25 19:29:44 UTC 2021
I added a comment. I think it’s useful (in a non-normative way)
for people to see some representative puzzlers. This will, of course,
further motivate a warning from javac.
> On Jun 25, 2021, at 12:04 PM, Alex Buckley <alex.buckley at oracle.com> wrote:
>
> I filed https://bugs.openjdk.java.net/browse/JDK-8269406 with some additional discussion about what the result of the first lexical translation step is really meant to be.
>
> Please take a look if you are familiar with the three-step translation described in JLS 3.2, and care about how the input stream is processed.
>
> Alex
>
> On 6/22/2021 10:38 AM, Alex Buckley wrote:
>> I am minded to extend the final note in JLS 3.3 to help people understand the multi-level escape story in play when they experiment with Unicode escapes. Perhaps it will also improve some javac error messages or test cases. Let me know what you think of this:
>> -----
>> For example, the input stream \u005cu005a results in the six characters \ u 0 0 5 a, because 005c is the Unicode value for \. It does not result in the character Z, which is Unicode character 005a, because the \ that resulted from the \u005c is not interpreted as the start of a further Unicode escape.
>> Note that \u005cu005a cannot be written in a string literal to denote the six characters \ u 0 0 5 a. This is because the first two characters resulting from translation, \ and u, are interpreted in a string literal as an illegal escape sequence (3.10.7).
>> Fortunately, the rule about contiguous \ characters helps programmers to craft input streams that denote Unicode escapes in a string literal. Denoting the six characters \ u 0 0 5 a in a string literal simply requires another \ to be written adjacent to the existing \, such as in "Z is \\u005a". This works because the second \ in the input stream \\u005a is not eligible, so the first \ and second \ are preserved as raw input characters; they are subsequently interpreted in a string literal as the escape sequence for a backslash, resulting in the desired six characters \ u 0 0 5 a. Without the rule, the input stream \\u005a would be translated as the raw input character \ followed by the Unicode escape \u005a (Z), but \Z is an illegal escape sequence in a string literal.
>> The rule also allows programmers to craft input streams that denote escape sequences in a string literal. For example, the input stream \\\u006e results in the three characters \ \ n because the third \ is eligible and thus \u006e is translated to n, while the first \ and second \ are preserved as raw input characters. The three characters \ \ n are subsequently interpreted in a string literal as \ n which denotes the escape sequence for a linefeed. (The input stream \\\u006e may also be written as \u005c\u005c\u006e.)
>> -----
>> Alex
>> On 6/21/2021 4:41 PM, Alex Buckley wrote:
>>> There's no question that the first six raw input characters \ u 0 0 5 c are identified as a Unicode escape \u005c and translated to a backslash.
>>>
>>> The question is whether that backslash is then treated as:
>>>
>>> 1. a raw input character \ that is followed by seven more raw input characters \ \ u 0 0 5 d For these *eight* raw input characters, there are three raw input character \'s in a row. Due to contiguous-\ counting, the third raw input character \ is eligible to begin a Unicode escape; the first and second pass through and you get \ \ ] which further translates within a string literal as \]
>>>
>>> or
>>>
>>> 2. something which is independent of the subsequent seven raw input characters \ \ u 0 0 5 d For those *seven* subsequent raw input characters, there are two raw input character \'s in a row. Due to contiguous-\ counting, the second raw input character \ is not eligible to begin a Unicode escape, so all seven raw input characters pass through. You get (including the first "independent" backslash) \ \ \ u 0 0 5 d
>>>
>>>
>>> The contiguous-\ counting is due to the fact that \\ is the escape sequence for backslash in a string literal, so we don't want too many raw \ input character to "disappear" into Unicode escapes.
>>>
>>>
>>> The JDK 15 behavior was #1. That looks correct to me. \ u 0 0 5 c becomes a raw input character \ that cannot serve as the opening backslash for an *immediate* Unicode escape (the classic JLS 3.3 scenario of \u005cu005a) but that can serve as a raw input character for the purpose of skipping over \\ pairs (the purpose of contiguous-\ counting) in order for a *later* Unicode escape to be recognized (\u005d).
>>>
>>>> Does "how many other \ characters contiguously precede it" refer to
>>>> preceding raw input characters, or does it refer to preceding
>>>> characters after unicode escape processing is performed on them?
>>>
>>> Where JLS 3.3 says "translating the ASCII characters \u followed by four hexadecimal digits to the UTF-16 code unit (§3.1) for the indicated hexadecimal value", it really means "translating the ASCII characters \u followed by four hexadecimal digits to *a raw input character which denotes* the UTF-16 code unit (§3.1) for the indicated hexadecimal value".
>>>
>>> Thus, the later clause "for each raw input character that is a backslash \, input processing must consider how many other [raw input] \ characters contiguously precede it" can be seen more easily to include characters that result from Unicode escape processing.
>>>
>>> Alex
>>>
>>> On 6/21/2021 2:56 PM, Jim Laskey wrote:
>>>> "\u005C” should have been treated as a backslash. Will check into it.
>>>>
>>>> Cheers,
>>>>
>>>> — Jim
>>>>
>>>>
>>>>
>>>>> On Jun 21, 2021, at 6:28 PM, Liam Miller-Cushon <cushon at google.com> wrote:
>>>>>
>>>>>
>>>>> class T {
>>>>> public static void main(String[] args) {
>>>>> System.err.println("\u005C\\u005D");
>>>>> }
>>>>> }
>>>>>
>>>>> Before JDK-8254073, this prints `\]`.
>>>>>
>>>>> After JDK-8254073, unicode escape processing results in `\\\u005D`, which results in an 'invalid escape' error for `\u`. Was that deliberate?
>>>>>
>>>>> JLS 3.3 says
>>>>>
>>>>>> for each raw input character that is a backslash \, input processing must consider how many other \ characters contiguously precede it, separating it from a non-\ character or the start of the input stream. If this number is even, then the \ is eligible to begin a Unicode escape; if the number is odd, then the \ is not eligible to begin a Unicode escape.
>>>>>
>>>>> The difference is in whether `\u005C` (the unicode escape for `\`) counts as one of the `\` preceding a valid unicode escape.
>>>>>
>>>>> Does "how many other \ characters contiguously precede it" refer to preceding raw input characters, or does it refer to preceding characters after unicode escape processing is performed on them?
>>>>>
>>>>> JLS 3.3 also mentions that a "character produced by a Unicode escape does not participate in further Unicode escapes", but I'm not sure if that applies here, since in the pre-JDK-8254073 interpretation the unicode-escaped backslash isn't really 'participating' in the second unicode escape.
>>>>>
>>>>> Thanks,
>>>>> Liam
More information about the compiler-dev
mailing list