JDK-8254073, unicode escape preprocessing, and \u005C

Thu Jul 15 17:43:50 UTC 2021

The fall out from discussion here and via the CSR (https://bugs.openjdk.java.net/browse/JDK-8269290) is that we have two choices (and noting today is RD2)

1) Proceed with proposed bug fix and strengthen the existing Interpretation in the JLS.

2) Withdraw the CSR, fix the bug to replicate the behaviour seen prior to JDK 16 and rework the JLS to reflect that behaviour.

At this point, Alex and I feel the correct choice is 2). This choice has the least risk and is likely the least disruptive.

If we see no objections here, we will move forward early next week.

— Jim

On Jul 6, 2021, at 8:26 PM, Alex Buckley <alex.buckley at oracle.com<mailto:alex.buckley at oracle.com>> wrote:

(I slightly reordered the table in JDK-8269290 in support of the following exposition.)

Since time immemorial, X = "\u005C\\u005D" has printed `\]` on the grounds that the six opening characters \ u 0 0 5 C form a backslash for the purpose of counting how many backslash characters contiguously precede the backslash in the final six characters \ u 0 0 5 D. (Two, making the backslash in \ u 0 0 5 D eligible to begin a Unicode escape.)

Given Y = "\u005C\u005C\u005D", it's consistent for the six opening characters \ u 0 0 5 C to again form a backslash for the purpose of counting how many backslash characters contiguously precede the backslash in the middle six characters \ u 0 0 5 C. Thus, "\u005C\u005C..." is treated the same as "\\u005C...".

I acknowledge this is an incompatible change, but consider the alternative. If the six opening characters *didn't* contribute a backslash to the count for \ u 0 0 5 C in the Y case, then the same six opening characters wouldn't contribute a backslash to the count for \ u 0 0 5 D in the X case. (In this alternative universe, people take the rule "The character produced by a Unicode escape does not participate in further Unicode escapes." literally.) Thus, in the X case, there would only be one backslash, denoted by the ASCII character, preceding the final six characters \ u 0 0 5 D  ==>  the \ in \ u 0 0 5 D would not be eligible  ==>  X would lex as \ \ \ u 0 0 5 D and print as `\\u005D` which is plain wrong.

(Maybe there is some application of the "longest possible translation" rule from 3.2 that lets the same six opening characters become a backslash-that-counts in X but not become a backslash-that-counts in Y. However, I do not know how to describe that application.)

Here's another test case for the CSR. JDK 15 does this:

jshell> System.out.println("\\u005D");
\u005D

jshell> System.out.println("\u005C\u005D");
|  Error:
|  illegal escape character
|  System.out.println("\u005C\u005D");
|                                 ^

With my consistency-first approach, the Z = "\u005C\u005D" case is legal, which seems far more reasonable than illegal. The six opening characters \ u 0 0 5 C form a backslash for the purpose of counting how many backslash characters contiguously precede the backslash in the final six characters \ u 0 0 5 D. (One, meaning the backslash in \ u 0 0 5 D is not eligible to begin a Unicode escape.) The result would be \ \ u 0 0 5 D which would print as `\u005D`.

Net net, I favor the correct fix -- and lots more test cases in the JCK.

Alex

On 7/2/2021 5:47 AM, Jim Laskey wrote:
Just so it doesn't look like I went rogue with the bug fix (https://bugs.openjdk.java.net/browse/JDK-8269290 <https://bugs.openjdk.java.net/browse/JDK-8269290>), I would like a consensus ruling on which is the bug fix I should use;
correct fix:
interpretAsPerJLS();
faithful fix:
if (sourceLevel <= 15)
    interpretOldWay();
        else
    interpretAsPerJLS();
status quo fix:
interpretOldWay();
I'm assuming correct fix, but others may have different assumptions.
Cheers,
-- Jim
On Jun 25, 2021, at 4:04 PM, Alex Buckley <alex.buckley at oracle.com<mailto:alex.buckley at oracle.com> <mailto:alex.buckley at oracle.com>> wrote:

I filed https://bugs.openjdk.java.net/browse/JDK-8269406 <https://bugs.openjdk.java.net/browse/JDK-8269406> with some additional discussion about what the result of the first lexical translation step is really meant to be.

Please take a look if you are familiar with the three-step translation described in JLS 3.2, and care about how the input stream is processed.

Alex

On 6/22/2021 10:38 AM, Alex Buckley wrote:
I am minded to extend the final note in JLS 3.3 to help people understand the multi-level escape story in play when they experiment with Unicode escapes. Perhaps it will also improve some javac error messages or test cases. Let me know what you think of this:
-----
For example, the input stream \u005cu005a results in the six characters \ u 0 0 5 a, because 005c is the Unicode value for \. It does not result in the character Z, which is Unicode character 005a, because the \ that resulted from the \u005c is not interpreted as the start of a further Unicode escape.
Note that \u005cu005a cannot be written in a string literal to denote the six characters \ u 0 0 5 a. This is because the first two characters resulting from translation, \ and u, are interpreted in a string literal as an illegal escape sequence (3.10.7).
Fortunately, the rule about contiguous \ characters helps programmers to craft input streams that denote Unicode escapes in a string literal. Denoting the six characters \ u 0 0 5 a in a string literal simply requires another \ to be written adjacent to the existing \, such as in "Z is \\u005a". This works because the second \ in the input stream \\u005a is not eligible, so the first \ and second \ are preserved as raw input characters; they are subsequently interpreted in a string literal as the escape sequence for a backslash, resulting in the desired six characters \ u 0 0 5 a. Without the rule, the input stream \\u005a would be translated as the raw input character \ followed by the Unicode escape \u005a (Z), but \Z is an illegal escape sequence in a string literal.
The rule also allows programmers to craft input streams that denote escape sequences in a string literal. For example, the input stream \\\u006e results in the three characters \ \ n because the third \ is eligible and thus \u006e is translated to n, while the first \ and second \ are preserved as raw input characters. The three characters \ \ n are subsequently interpreted in a string literal as \ n which denotes the escape sequence for a linefeed. (The input stream \\\u006e may also be written as \u005c\u005c\u006e.)
-----
Alex
On 6/21/2021 4:41 PM, Alex Buckley wrote:
There's no question that the first six raw input characters \ u 0 0 5 c are identified as a Unicode escape \u005c and translated to a backslash.

The question is whether that backslash is then treated as:

1. a raw input character \ that is followed by seven more raw input characters \ \ u 0 0 5 d   For these *eight* raw input characters, there are three raw input character \'s in a row. Due to contiguous-\ counting, the third raw input character \ is eligible to begin a Unicode escape; the first and second pass through and you get \ \ ] which further translates within a string literal as \]

or

2. something which is independent of the subsequent seven raw input characters \ \ u 0 0 5 d   For those *seven* subsequent raw input characters, there are two raw input character \'s in a row. Due to contiguous-\ counting, the second raw input character \ is not eligible to begin a Unicode escape, so all seven raw input characters pass through. You get (including the first "independent" backslash) \ \ \ u 0 0 5 d

The contiguous-\ counting is due to the fact that \\ is the escape sequence for backslash in a string literal, so we don't want too many raw \ input character to "disappear" into Unicode escapes.

The JDK 15 behavior was #1. That looks correct to me. \ u 0 0 5 c becomes a raw input character \ that cannot serve as the opening backslash for an *immediate* Unicode escape (the classic JLS 3.3 scenario of \u005cu005a) but that can serve as a raw input character for the purpose of skipping over \\ pairs (the purpose of contiguous-\ counting) in order for a *later* Unicode escape to be recognized (\u005d).

Does "how many other \ characters contiguously precede it" refer to
preceding raw input characters, or does it refer to preceding
characters after unicode escape processing is performed on them?

Where JLS 3.3 says "translating the ASCII characters \u followed by four hexadecimal digits to the UTF-16 code unit (§3.1) for the indicated hexadecimal value", it really means "translating the ASCII characters \u followed by four hexadecimal digits to *a raw input character which denotes* the UTF-16 code unit (§3.1) for the indicated hexadecimal value".

Thus, the later clause "for each raw input character that is a backslash \, input processing must consider how many other [raw input] \ characters contiguously precede it" can be seen more easily to include characters that result from Unicode escape processing.

Alex

On 6/21/2021 2:56 PM, Jim Laskey wrote:
"\u005C” should have been treated as a backslash. Will check into it.

Cheers,

— Jim

��

On Jun 21, 2021, at 6:28 PM, Liam Miller-Cushon <cushon at google.com<mailto:cushon at google.com> <mailto:cushon at google.com>> wrote:

class T {
   public static void main(String[] args) {
     System.err.println("\u005C\\u005D");
   }
}

Before JDK-8254073, this prints `\]`.

After JDK-8254073, unicode escape processing results in `\\\u005D`, which results in an 'invalid escape' error for `\u`. Was that deliberate?

JLS 3.3 says

for each raw input character that is a backslash \, input processing must consider how many other \ characters contiguously precede it, separating it from a non-\ character or the start of the input stream. If this number is even, then the \ is eligible to begin a Unicode escape; if the number is odd, then the \ is not eligible to begin a Unicode escape.

The difference is in whether `\u005C` (the unicode escape for `\`) counts as one of the `\` preceding a valid unicode escape.

Does "how many other \ characters contiguously precede it" refer to preceding raw input characters, or does it refer to preceding characters after unicode escape processing is performed on them?

JLS 3.3 also mentions that a "character produced by a Unicode escape does not participate in further Unicode escapes", but I'm not sure if that applies here, since in the pre-JDK-8254073 interpretation the unicode-escaped backslash isn't really 'participating' in the second unicode escape.

Thanks,
Liam

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.java.net/pipermail/compiler-dev/attachments/20210715/56eb60f2/attachment-0001.htm>