JDK-8254073, unicode escape preprocessing, and \u005C

Thu Jul 22 20:25:59 UTC 2021

Many thanks for this analysis Liam.

We analyzed a corpus that we use internally for javac experiments. 
Enough programs assume that \u005C\u005C means \\ that we are loathe to 
make it mean something else in 17, no matter how much the idea of 
"something else" is appealing because it would streamline the JLS or 
simplify the javac implementation.

The case of \u005C\\u005D is more straightforward because it translated 
to \\] both in javac 15 and (with some generosity of intent) in the JLS.

Accordingly, we're picking option 2 -- we'll fix the JLS to fully 
explain the javac 15 behavior (including why \u005C\u005C means \\ and 
not \\u005C), and fix javac 17 to treat \u005C\\u005D like javac 15 did.

Alex

On 7/15/2021 8:12 PM, Liam Miller-Cushon wrote:
> In case it's helpful, here's some more context on what I've seen of the 
> compatibility impact.
> 
> I originally noticed the change in a single project that contained three 
> examples of \u005C\\u005D. That code is in an obsolete version of the 
> 'stanford-parser' library, and I think that example may have been 
> incorrect for either interpretation of the escapes. That example can be 
> seen here:
> 
> $ wget http://nlp.stanford.edu/software/stanford-parser-2011-06-23.tgz 
> <http://nlp.stanford.edu/software/stanford-parser-2011-06-23.tgz>
> $ tar xzvf stanford-parser-2011-06-23.tgz
> $ grep 'u005C' 
> ./stanford-parser-2011-06-23/src/edu/stanford/nlp/international/arabic/pipeline/DefaultLexicalMapper.java 
> ./stanford-parser-2011-06-23/src/edu/stanford/nlp/parser/lexparser/FrenchUnknownWordModel.java
> 
> I evaluated the fix in option (1) on my employer's codebase and didn't 
> see any regressions. I also realized that another tool we use that 
> processes Java source and implements its own unicode escape processing 
> has been implementing the same approach as the proposed fix all along. 
> So the difference doesn't affect a lot of code that I have visibility 
> into, for whatever that's worth.
> 
> I did some searching for occurrences of \u005C\u005C, which is an 
> example that would be interpreted differently under the new rules, and 
> found three of those:
> 
> * javadoc in java.lang.String, where I think the intent was \\: 
> https://github.com/openjdk/jdk/blob/e35005d5ce383ddd108096a3079b17cb0bcf76f1/src/java.base/share/classes/java/lang/String.java#L3916 
> <https://github.com/openjdk/jdk/blob/e35005d5ce383ddd108096a3079b17cb0bcf76f1/src/java.base/share/classes/java/lang/String.java#L3916>
> 
> * a test case in sun.net.idn: 
> https://github.com/openjdk/jdk/blob/e35005d5ce383ddd108096a3079b17cb0bcf76f1/test/jdk/sun/net/idn/TestStringPrep.java#L131 
> <https://github.com/openjdk/jdk/blob/e35005d5ce383ddd108096a3079b17cb0bcf76f1/test/jdk/sun/net/idn/TestStringPrep.java#L131>
> 
> * a test utility where the intent was \\
> 
> So based on a sample size of 3, there are more examples of \u005C\u005C 
> where the intent was \\ than \\u005C. But either way, examples of this 
> seem to be extremely rare.
> 
> Personally I think the compatibility impact from (1) will be minimal and 
> I'd so I'd lean towards the option that's most intuitive to specify and 
> implement, but I defer to your judgement of the compatibility impact and 
> if you proceed with (2) that's fine with me.
> 
> Liam
> 
> On Thu, Jul 15, 2021 at 10:44 AM Jim Laskey <james.laskey at oracle.com 
> <mailto:james.laskey at oracle.com>> wrote:
> 
>     The fall out from discussion here and via the CSR
>     (https://bugs.openjdk.java.net/browse/JDK-8269290
>     <https://bugs.openjdk.java.net/browse/JDK-8269290>) is that we have
>     two choices (and noting today is RD2)
> 
>     1) Proceed with proposed bug fix and strengthen the existing
>     Interpretation in the JLS.
> 
>     2) Withdraw the CSR, fix the bug to replicate the behaviour seen
>     prior to JDK 16 and rework the JLS to reflect that behaviour.
> 
>     At this point, Alex and I feel the correct choice is 2). This choice
>     has the least risk and is likely the least disruptive.
> 
>     If we see no objections here, we will move forward early next week.
> 
>     — Jim
> 
> 
> 
>>     On Jul 6, 2021, at 8:26 PM, Alex Buckley <alex.buckley at oracle.com
>>     <mailto:alex.buckley at oracle.com>> wrote:
>>
>>     (I slightly reordered the table in JDK-8269290 in support of the
>>     following exposition.)
>>
>>     Since time immemorial, X = "\u005C\\u005D" has printed `\]` on the
>>     grounds that the six opening characters \ u 0 0 5 C form a
>>     backslash for the purpose of counting how many backslash
>>     characters contiguously precede the backslash in the final six
>>     characters \ u 0 0 5 D. (Two, making the backslash in \ u 0 0 5 D
>>     eligible to begin a Unicode escape.)
>>
>>     Given Y = "\u005C\u005C\u005D", it's consistent for the six
>>     opening characters \ u 0 0 5 C to again form a backslash for the
>>     purpose of counting how many backslash characters contiguously
>>     precede the backslash in the middle six characters \ u 0 0 5 C.
>>     Thus, "\u005C\u005C..." is treated the same as "\\u005C...".
>>
>>     I acknowledge this is an incompatible change, but consider the
>>     alternative. If the six opening characters *didn't* contribute a
>>     backslash to the count for \ u 0 0 5 C in the Y case, then the
>>     same six opening characters wouldn't contribute a backslash to the
>>     count for \ u 0 0 5 D in the X case. (In this alternative
>>     universe, people take the rule "The character produced by a
>>     Unicode escape does not participate in further Unicode escapes."
>>     literally.) Thus, in the X case, there would only be one
>>     backslash, denoted by the ASCII character, preceding the final six
>>     characters \ u 0 0 5 D  ==>  the \ in \ u 0 0 5 D would not be
>>     eligible  ==>  X would lex as \ \ \ u 0 0 5 D and print as
>>     `\\u005D` which is plain wrong.
>>
>>     (Maybe there is some application of the "longest possible
>>     translation" rule from 3.2 that lets the same six opening
>>     characters become a backslash-that-counts in X but not become a
>>     backslash-that-counts in Y. However, I do not know how to describe
>>     that application.)
>>
>>
>>     Here's another test case for the CSR. JDK 15 does this:
>>
>>     jshell> System.out.println("\\u005D");
>>     \u005D
>>
>>     jshell> System.out.println("\u005C\u005D");
>>     |  Error:
>>     |  illegal escape character
>>     |  System.out.println("\u005C\u005D");
>>     |                                 ^
>>
>>     With my consistency-first approach, the Z = "\u005C\u005D" case is
>>     legal, which seems far more reasonable than illegal. The six
>>     opening characters \ u 0 0 5 C form a backslash for the purpose of
>>     counting how many backslash characters contiguously precede the
>>     backslash in the final six characters \ u 0 0 5 D. (One, meaning
>>     the backslash in \ u 0 0 5 D is not eligible to begin a Unicode
>>     escape.) The result would be \ \ u 0 0 5 D which would print as
>>     `\u005D`.
>>
>>
>>     Net net, I favor the correct fix -- and lots more test cases in
>>     the JCK.
>>
>>     Alex
>>
>>     On 7/2/2021 5:47 AM, Jim Laskey wrote:
>>>     Just so it doesn't look like I went rogue with the bug fix
>>>     (https://bugs.openjdk.java.net/browse/JDK-8269290
>>>     <https://bugs.openjdk.java.net/browse/JDK-8269290>
>>>     <https://bugs.openjdk.java.net/browse/JDK-8269290
>>>     <https://bugs.openjdk.java.net/browse/JDK-8269290>>), I would
>>>     like a consensus ruling on which is the bug fix I should use;
>>>     correct fix:
>>>     interpretAsPerJLS();
>>>     faithful fix:
>>>     if (sourceLevel <= 15)
>>>         interpretOldWay();
>>>             else
>>>         interpretAsPerJLS();
>>>     status quo fix:
>>>     interpretOldWay();
>>>     I'm assuming correct fix, but others may have different assumptions.
>>>     Cheers,
>>>     -- Jim
>>>>     On Jun 25, 2021, at 4:04 PM, Alex Buckley
>>>>     <alex.buckley at oracle.com <mailto:alex.buckley at oracle.com>
>>>>     <mailto:alex.buckley at oracle.com
>>>>     <mailto:alex.buckley at oracle.com>>> wrote:
>>>>
>>>>     I filed https://bugs.openjdk.java.net/browse/JDK-8269406
>>>>     <https://bugs.openjdk.java.net/browse/JDK-8269406>
>>>>     <https://bugs.openjdk.java.net/browse/JDK-8269406
>>>>     <https://bugs.openjdk.java.net/browse/JDK-8269406>> with some
>>>>     additional discussion about what the result of the first lexical
>>>>     translation step is really meant to be.
>>>>
>>>>     Please take a look if you are familiar with the three-step
>>>>     translation described in JLS 3.2, and care about how the input
>>>>     stream is processed.
>>>>
>>>>     Alex
>>>>
>>>>     On 6/22/2021 10:38 AM, Alex Buckley wrote:
>>>>>     I am minded to extend the final note in JLS 3.3 to help people
>>>>>     understand the multi-level escape story in play when they
>>>>>     experiment with Unicode escapes. Perhaps it will also improve
>>>>>     some javac error messages or test cases. Let me know what you
>>>>>     think of this:
>>>>>     -----
>>>>>     For example, the input stream \u005cu005a results in the six
>>>>>     characters \ u 0 0 5 a, because 005c is the Unicode value for
>>>>>     \. It does not result in the character Z, which is Unicode
>>>>>     character 005a, because the \ that resulted from the \u005c is
>>>>>     not interpreted as the start of a further Unicode escape.
>>>>>     Note that \u005cu005a cannot be written in a string literal to
>>>>>     denote the six characters \ u 0 0 5 a. This is because the
>>>>>     first two characters resulting from translation, \ and u, are
>>>>>     interpreted in a string literal as an illegal escape sequence
>>>>>     (3.10.7).
>>>>>     Fortunately, the rule about contiguous \ characters helps
>>>>>     programmers to craft input streams that denote Unicode escapes
>>>>>     in a string literal. Denoting the six characters \ u 0 0 5 a in
>>>>>     a string literal simply requires another \ to be written
>>>>>     adjacent to the existing \, such as in "Z is \\u005a". This
>>>>>     works because the second \ in the input stream \\u005a is not
>>>>>     eligible, so the first \ and second \ are preserved as raw
>>>>>     input characters; they are subsequently interpreted in a string
>>>>>     literal as the escape sequence for a backslash, resulting in
>>>>>     the desired six characters \ u 0 0 5 a. Without the rule, the
>>>>>     input stream \\u005a would be translated as the raw input
>>>>>     character \ followed by the Unicode escape \u005a (Z), but \Z
>>>>>     is an illegal escape sequence in a string literal.
>>>>>     The rule also allows programmers to craft input streams that
>>>>>     denote escape sequences in a string literal. For example, the
>>>>>     input stream \\\u006e results in the three characters \ \ n
>>>>>     because the third \ is eligible and thus \u006e is translated
>>>>>     to n, while the first \ and second \ are preserved as raw input
>>>>>     characters. The three characters \ \ n are subsequently
>>>>>     interpreted in a string literal as \ n which denotes the escape
>>>>>     sequence for a linefeed. (The input stream \\\u006e may also be
>>>>>     written as \u005c\u005c\u006e.)
>>>>>     -----
>>>>>     Alex
>>>>>     On 6/21/2021 4:41 PM, Alex Buckley wrote:
>>>>>>     There's no question that the first six raw input characters \
>>>>>>     u 0 0 5 c are identified as a Unicode escape \u005c and
>>>>>>     translated to a backslash.
>>>>>>
>>>>>>     The question is whether that backslash is then treated as:
>>>>>>
>>>>>>     1. a raw input character \ that is followed by seven more raw
>>>>>>     input characters \ \ u 0 0 5 d   For these *eight* raw input
>>>>>>     characters, there are three raw input character \'s in a row.
>>>>>>     Due to contiguous-\ counting, the third raw input character \
>>>>>>     is eligible to begin a Unicode escape; the first and second
>>>>>>     pass through and you get \ \ ] which further translates within
>>>>>>     a string literal as \]
>>>>>>
>>>>>>     or
>>>>>>
>>>>>>     2. something which is independent of the subsequent seven raw
>>>>>>     input characters \ \ u 0 0 5 d   For those *seven* subsequent
>>>>>>     raw input characters, there are two raw input character \'s in
>>>>>>     a row. Due to contiguous-\ counting, the second raw input
>>>>>>     character \ is not eligible to begin a Unicode escape, so all
>>>>>>     seven raw input characters pass through. You get (including
>>>>>>     the first "independent" backslash) \ \ \ u 0 0 5 d
>>>>>>
>>>>>>
>>>>>>     The contiguous-\ counting is due to the fact that \\ is the
>>>>>>     escape sequence for backslash in a string literal, so we don't
>>>>>>     want too many raw \ input character to "disappear" into
>>>>>>     Unicode escapes.
>>>>>>
>>>>>>
>>>>>>     The JDK 15 behavior was #1. That looks correct to me. \ u 0 0
>>>>>>     5 c becomes a raw input character \ that cannot serve as the
>>>>>>     opening backslash for an *immediate* Unicode escape (the
>>>>>>     classic JLS 3.3 scenario of \u005cu005a) but that can serve as
>>>>>>     a raw input character for the purpose of skipping over \\
>>>>>>     pairs (the purpose of contiguous-\ counting) in order for a
>>>>>>     *later* Unicode escape to be recognized (\u005d).
>>>>>>
>>>>>>>     Does "how many other \ characters contiguously precede it"
>>>>>>>     refer to
>>>>>>>     preceding raw input characters, or does it refer to preceding
>>>>>>>     characters after unicode escape processing is performed on them?
>>>>>>
>>>>>>     Where JLS 3.3 says "translating the ASCII characters \u
>>>>>>     followed by four hexadecimal digits to the UTF-16 code unit
>>>>>>     (§3.1) for the indicated hexadecimal value", it really means
>>>>>>     "translating the ASCII characters \u followed by four
>>>>>>     hexadecimal digits to *a raw input character which denotes*
>>>>>>     the UTF-16 code unit (§3.1) for the indicated hexadecimal value".
>>>>>>
>>>>>>     Thus, the later clause "for each raw input character that is a
>>>>>>     backslash \, input processing must consider how many other
>>>>>>     [raw input] \ characters contiguously precede it" can be seen
>>>>>>     more easily to include characters that result from Unicode
>>>>>>     escape processing.
>>>>>>
>>>>>>     Alex
>>>>>>
>>>>>>     On 6/21/2021 2:56 PM, Jim Laskey wrote:
>>>>>>>     "\u005C” should have been treated as a backslash. Will check
>>>>>>>     into it.
>>>>>>>
>>>>>>>     Cheers,
>>>>>>>
>>>>>>>     — Jim
>>>>>>>
>>>>>>>     ��
>>>>>>>
>>>>>>>>     On Jun 21, 2021, at 6:28 PM, Liam Miller-Cushon
>>>>>>>>     <cushon at google.com <mailto:cushon at google.com>
>>>>>>>>     <mailto:cushon at google.com <mailto:cushon at google.com>>> wrote:
>>>>>>>>
>>>>>>>>     
>>>>>>>>     class T {
>>>>>>>>        public static void main(String[] args) {
>>>>>>>>          System.err.println("\u005C\\u005D");
>>>>>>>>        }
>>>>>>>>     }
>>>>>>>>
>>>>>>>>     Before JDK-8254073, this prints `\]`.
>>>>>>>>
>>>>>>>>     After JDK-8254073, unicode escape processing results in
>>>>>>>>     `\\\u005D`, which results in an 'invalid escape' error for
>>>>>>>>     `\u`. Was that deliberate?
>>>>>>>>
>>>>>>>>     JLS 3.3 says
>>>>>>>>
>>>>>>>>>     for each raw input character that is a backslash \, input
>>>>>>>>>     processing must consider how many other \ characters
>>>>>>>>>     contiguously precede it, separating it from a non-\
>>>>>>>>>     character or the start of the input stream. If this number
>>>>>>>>>     is even, then the \ is eligible to begin a Unicode escape;
>>>>>>>>>     if the number is odd, then the \ is not eligible to begin a
>>>>>>>>>     Unicode escape.
>>>>>>>>
>>>>>>>>     The difference is in whether `\u005C` (the unicode escape
>>>>>>>>     for `\`) counts as one of the `\` preceding a valid unicode
>>>>>>>>     escape.
>>>>>>>>
>>>>>>>>     Does "how many other \ characters contiguously precede it"
>>>>>>>>     refer to preceding raw input characters, or does it refer to
>>>>>>>>     preceding characters after unicode escape processing is
>>>>>>>>     performed on them?
>>>>>>>>
>>>>>>>>     JLS 3.3 also mentions that a "character produced by a
>>>>>>>>     Unicode escape does not participate in further Unicode
>>>>>>>>     escapes", but I'm not sure if that applies here, since in
>>>>>>>>     the pre-JDK-8254073 interpretation the unicode-escaped
>>>>>>>>     backslash isn't really 'participating' in the second unicode
>>>>>>>>     escape.
>>>>>>>>
>>>>>>>>     Thanks,
>>>>>>>>     Liam
>