JDK-8254073, unicode escape preprocessing, and \u005C
Liam Miller-Cushon
cushon at google.com
Fri Jul 16 03:12:07 UTC 2021
In case it's helpful, here's some more context on what I've seen of the
compatibility impact.
I originally noticed the change in a single project that contained three
examples of \u005C\\u005D. That code is in an obsolete version of the
'stanford-parser' library, and I think that example may have been incorrect
for either interpretation of the escapes. That example can be seen here:
$ wget http://nlp.stanford.edu/software/stanford-parser-2011-06-23.tgz
$ tar xzvf stanford-parser-2011-06-23.tgz
$ grep 'u005C'
./stanford-parser-2011-06-23/src/edu/stanford/nlp/international/arabic/pipeline/DefaultLexicalMapper.java
./stanford-parser-2011-06-23/src/edu/stanford/nlp/parser/lexparser/FrenchUnknownWordModel.java
I evaluated the fix in option (1) on my employer's codebase and didn't see
any regressions. I also realized that another tool we use that processes
Java source and implements its own unicode escape processing has been
implementing the same approach as the proposed fix all along. So the
difference doesn't affect a lot of code that I have visibility into, for
whatever that's worth.
I did some searching for occurrences of \u005C\u005C, which is an example
that would be interpreted differently under the new rules, and found three
of those:
* javadoc in java.lang.String, where I think the intent was \\:
https://github.com/openjdk/jdk/blob/e35005d5ce383ddd108096a3079b17cb0bcf76f1/src/java.base/share/classes/java/lang/String.java#L3916
* a test case in sun.net.idn:
https://github.com/openjdk/jdk/blob/e35005d5ce383ddd108096a3079b17cb0bcf76f1/test/jdk/sun/net/idn/TestStringPrep.java#L131
* a test utility where the intent was \\
So based on a sample size of 3, there are more examples of \u005C\u005C
where the intent was \\ than \\u005C. But either way, examples of this seem
to be extremely rare.
Personally I think the compatibility impact from (1) will be minimal and
I'd so I'd lean towards the option that's most intuitive to specify and
implement, but I defer to your judgement of the compatibility impact and if
you proceed with (2) that's fine with me.
Liam
On Thu, Jul 15, 2021 at 10:44 AM Jim Laskey <james.laskey at oracle.com> wrote:
> The fall out from discussion here and via the CSR (
> https://bugs.openjdk.java.net/browse/JDK-8269290) is that we have two
> choices (and noting today is RD2)
>
> 1) Proceed with proposed bug fix and strengthen the existing
> Interpretation in the JLS.
>
> 2) Withdraw the CSR, fix the bug to replicate the behaviour seen prior to
> JDK 16 and rework the JLS to reflect that behaviour.
>
> At this point, Alex and I feel the correct choice is 2). This choice has
> the least risk and is likely the least disruptive.
>
> If we see no objections here, we will move forward early next week.
>
> — Jim
>
>
>
> On Jul 6, 2021, at 8:26 PM, Alex Buckley <alex.buckley at oracle.com> wrote:
>
> (I slightly reordered the table in JDK-8269290 in support of the following
> exposition.)
>
> Since time immemorial, X = "\u005C\\u005D" has printed `\]` on the grounds
> that the six opening characters \ u 0 0 5 C form a backslash for the
> purpose of counting how many backslash characters contiguously precede the
> backslash in the final six characters \ u 0 0 5 D. (Two, making the
> backslash in \ u 0 0 5 D eligible to begin a Unicode escape.)
>
> Given Y = "\u005C\u005C\u005D", it's consistent for the six opening
> characters \ u 0 0 5 C to again form a backslash for the purpose of
> counting how many backslash characters contiguously precede the backslash
> in the middle six characters \ u 0 0 5 C. Thus, "\u005C\u005C..." is
> treated the same as "\\u005C...".
>
> I acknowledge this is an incompatible change, but consider the
> alternative. If the six opening characters *didn't* contribute a backslash
> to the count for \ u 0 0 5 C in the Y case, then the same six opening
> characters wouldn't contribute a backslash to the count for \ u 0 0 5 D in
> the X case. (In this alternative universe, people take the rule "The
> character produced by a Unicode escape does not participate in further
> Unicode escapes." literally.) Thus, in the X case, there would only be one
> backslash, denoted by the ASCII character, preceding the final six
> characters \ u 0 0 5 D ==> the \ in \ u 0 0 5 D would not be eligible
> ==> X would lex as \ \ \ u 0 0 5 D and print as `\\u005D` which is plain
> wrong.
>
> (Maybe there is some application of the "longest possible translation"
> rule from 3.2 that lets the same six opening characters become a
> backslash-that-counts in X but not become a backslash-that-counts in Y.
> However, I do not know how to describe that application.)
>
>
> Here's another test case for the CSR. JDK 15 does this:
>
> jshell> System.out.println("\\u005D");
> \u005D
>
> jshell> System.out.println("\u005C\u005D");
> | Error:
> | illegal escape character
> | System.out.println("\u005C\u005D");
> | ^
>
> With my consistency-first approach, the Z = "\u005C\u005D" case is legal,
> which seems far more reasonable than illegal. The six opening characters \
> u 0 0 5 C form a backslash for the purpose of counting how many backslash
> characters contiguously precede the backslash in the final six characters \
> u 0 0 5 D. (One, meaning the backslash in \ u 0 0 5 D is not eligible to
> begin a Unicode escape.) The result would be \ \ u 0 0 5 D which would
> print as `\u005D`.
>
>
> Net net, I favor the correct fix -- and lots more test cases in the JCK.
>
> Alex
>
> On 7/2/2021 5:47 AM, Jim Laskey wrote:
>
> Just so it doesn't look like I went rogue with the bug fix (
> https://bugs.openjdk.java.net/browse/JDK-8269290 <
> https://bugs.openjdk.java.net/browse/JDK-8269290>), I would like a
> consensus ruling on which is the bug fix I should use;
> correct fix:
> interpretAsPerJLS();
> faithful fix:
> if (sourceLevel <= 15)
> interpretOldWay();
> else
> interpretAsPerJLS();
> status quo fix:
> interpretOldWay();
> I'm assuming correct fix, but others may have different assumptions.
> Cheers,
> -- Jim
>
> On Jun 25, 2021, at 4:04 PM, Alex Buckley <alex.buckley at oracle.com <
> mailto:alex.buckley at oracle.com <alex.buckley at oracle.com>>> wrote:
>
> I filed https://bugs.openjdk.java.net/browse/JDK-8269406 <
> https://bugs.openjdk.java.net/browse/JDK-8269406> with some additional
> discussion about what the result of the first lexical translation step is
> really meant to be.
>
> Please take a look if you are familiar with the three-step translation
> described in JLS 3.2, and care about how the input stream is processed.
>
> Alex
>
> On 6/22/2021 10:38 AM, Alex Buckley wrote:
>
> I am minded to extend the final note in JLS 3.3 to help people understand
> the multi-level escape story in play when they experiment with Unicode
> escapes. Perhaps it will also improve some javac error messages or test
> cases. Let me know what you think of this:
> -----
> For example, the input stream \u005cu005a results in the six characters \
> u 0 0 5 a, because 005c is the Unicode value for \. It does not result in
> the character Z, which is Unicode character 005a, because the \ that
> resulted from the \u005c is not interpreted as the start of a further
> Unicode escape.
> Note that \u005cu005a cannot be written in a string literal to denote the
> six characters \ u 0 0 5 a. This is because the first two characters
> resulting from translation, \ and u, are interpreted in a string literal as
> an illegal escape sequence (3.10.7).
> Fortunately, the rule about contiguous \ characters helps programmers to
> craft input streams that denote Unicode escapes in a string literal.
> Denoting the six characters \ u 0 0 5 a in a string literal simply requires
> another \ to be written adjacent to the existing \, such as in "Z is
> \\u005a". This works because the second \ in the input stream \\u005a is
> not eligible, so the first \ and second \ are preserved as raw input
> characters; they are subsequently interpreted in a string literal as the
> escape sequence for a backslash, resulting in the desired six characters \
> u 0 0 5 a. Without the rule, the input stream \\u005a would be translated
> as the raw input character \ followed by the Unicode escape \u005a (Z), but
> \Z is an illegal escape sequence in a string literal.
> The rule also allows programmers to craft input streams that denote escape
> sequences in a string literal. For example, the input stream \\\u006e
> results in the three characters \ \ n because the third \ is eligible and
> thus \u006e is translated to n, while the first \ and second \ are
> preserved as raw input characters. The three characters \ \ n are
> subsequently interpreted in a string literal as \ n which denotes the
> escape sequence for a linefeed. (The input stream \\\u006e may also be
> written as \u005c\u005c\u006e.)
> -----
> Alex
> On 6/21/2021 4:41 PM, Alex Buckley wrote:
>
> There's no question that the first six raw input characters \ u 0 0 5 c
> are identified as a Unicode escape \u005c and translated to a backslash.
>
> The question is whether that backslash is then treated as:
>
> 1. a raw input character \ that is followed by seven more raw input
> characters \ \ u 0 0 5 d For these *eight* raw input characters, there
> are three raw input character \'s in a row. Due to contiguous-\ counting,
> the third raw input character \ is eligible to begin a Unicode escape; the
> first and second pass through and you get \ \ ] which further translates
> within a string literal as \]
>
> or
>
> 2. something which is independent of the subsequent seven raw input
> characters \ \ u 0 0 5 d For those *seven* subsequent raw input
> characters, there are two raw input character \'s in a row. Due to
> contiguous-\ counting, the second raw input character \ is not eligible to
> begin a Unicode escape, so all seven raw input characters pass through. You
> get (including the first "independent" backslash) \ \ \ u 0 0 5 d
>
>
> The contiguous-\ counting is due to the fact that \\ is the escape
> sequence for backslash in a string literal, so we don't want too many raw \
> input character to "disappear" into Unicode escapes.
>
>
> The JDK 15 behavior was #1. That looks correct to me. \ u 0 0 5 c becomes
> a raw input character \ that cannot serve as the opening backslash for an
> *immediate* Unicode escape (the classic JLS 3.3 scenario of \u005cu005a)
> but that can serve as a raw input character for the purpose of skipping
> over \\ pairs (the purpose of contiguous-\ counting) in order for a *later*
> Unicode escape to be recognized (\u005d).
>
> Does "how many other \ characters contiguously precede it" refer to
> preceding raw input characters, or does it refer to preceding
> characters after unicode escape processing is performed on them?
>
>
> Where JLS 3.3 says "translating the ASCII characters \u followed by four
> hexadecimal digits to the UTF-16 code unit (§3.1) for the indicated
> hexadecimal value", it really means "translating the ASCII characters \u
> followed by four hexadecimal digits to *a raw input character which
> denotes* the UTF-16 code unit (§3.1) for the indicated hexadecimal value".
>
> Thus, the later clause "for each raw input character that is a backslash
> \, input processing must consider how many other [raw input] \ characters
> contiguously precede it" can be seen more easily to include characters that
> result from Unicode escape processing.
>
> Alex
>
> On 6/21/2021 2:56 PM, Jim Laskey wrote:
>
> "\u005C” should have been treated as a backslash. Will check into it.
>
> Cheers,
>
> — Jim
>
>
>
> On Jun 21, 2021, at 6:28 PM, Liam Miller-Cushon <cushon at google.com <
> mailto:cushon at google.com <cushon at google.com>>> wrote:
>
>
> class T {
> public static void main(String[] args) {
> System.err.println("\u005C\\u005D");
> }
> }
>
> Before JDK-8254073, this prints `\]`.
>
> After JDK-8254073, unicode escape processing results in `\\\u005D`, which
> results in an 'invalid escape' error for `\u`. Was that deliberate?
>
> JLS 3.3 says
>
> for each raw input character that is a backslash \, input processing must
> consider how many other \ characters contiguously precede it, separating it
> from a non-\ character or the start of the input stream. If this number is
> even, then the \ is eligible to begin a Unicode escape; if the number is
> odd, then the \ is not eligible to begin a Unicode escape.
>
>
> The difference is in whether `\u005C` (the unicode escape for `\`) counts
> as one of the `\` preceding a valid unicode escape.
>
> Does "how many other \ characters contiguously precede it" refer to
> preceding raw input characters, or does it refer to preceding characters
> after unicode escape processing is performed on them?
>
> JLS 3.3 also mentions that a "character produced by a Unicode escape does
> not participate in further Unicode escapes", but I'm not sure if that
> applies here, since in the pre-JDK-8254073 interpretation the
> unicode-escaped backslash isn't really 'participating' in the second
> unicode escape.
>
> Thanks,
> Liam
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.java.net/pipermail/compiler-dev/attachments/20210715/867ec06d/attachment-0001.htm>
More information about the compiler-dev
mailing list