[jdk17] RFR: JDK-8269150 UnicodeReader not translating \u005c\\u005d to \\] [v8]

Mon Jul 26 17:25:38 UTC 2021

On Fri, 23 Jul 2021 13:31:50 GMT, Jim Laskey <jlaskey at openjdk.org> wrote:

>> This issue relates to *Unicode escapes*, described in section 3.3 of the JLS. javac interprets Unicode escapes during the reading of ASCII characters from source. Later on, javac interprets *escape sequences*, described in section 3.7 of the JLS, during the tokenization of character literals, string literals, and text blocks. Escape sequences are only indirectly affected by this bug.
>> 
>> During reading, a _normal backslash_ (that is, the ASCII `` character, not the corresponding Unicode escape `\u005c`) followed by another normal backslash is treated collectively as a pair of backslash characters. No further interpretation is done. This means that if a normal backslash immediately precedes the sequence `` `u` `A` `B` `C` `D` which would "normally" be interpreted as an Unicode escape, then the interpretation of that sequence as a Unicode escape is suppressed.
>> 
>> For example, the sequence `\u2022` would be interpreted as the `•` character, whereas `\\u2022` would be interpreted as the seven characters `` `` `u` `2` `0` `2` `2`.
>> 
>> An issue arises when Java developers choose to use a _Unicode escape backslash_ `\u005c` in their source code, instead of a normal backslash. Prior to JDK 16, if the Unicode escape backslash was followed by a second Unicode escape, then *the second Unicode escape was always interpreted*. The normal backslash at the beginning of the second Unicode escape (immediately followed by `u`) was *not* paired with the preceding Unicode escape backslash. Elsewise, any following normal backslash will be paired with the `\u005c`.
>> 
>> For example, the sequence `\u005c\u2022` would be interpreted as `` and `•`, whereas `\u005c\tXYZ` would be interpreted as `` `` `t` `X` `Y` `Z`.
>> 
>> The bug in JDK 16 ignored `\u005c` as having any effect on Unicode interpretation. Using the example from compiler-dev discussions, `\u005c\\u005d` :
>> 
>> - Prior to JDK 16, it was interpreted as `` `` `]`
>> - JDK 16 interpreted it as `` `` `` `u` `0` `0` `5` `d` which would produce a syntax error downstream in the lexer because the escape sequence `\u` is invalid.
>
> Jim Laskey has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 11 additional commits since the last revision:
> 
>  - Merge branch 'master' into 8269150
>  - Update UnicodeBackslash test to be easier to follow
>  - Remove comment duplicated by merge
>  - Merge branch 'master' into 8269150
>  - Merge branch '8269150b' into 8269150
>  - Use jdk15 logic
>  - Proposed change
>  - Merge branch 'master' into 8269150
>  - Updated the test to include all combinations
>  - Merge branch 'master' into 8269150
>  - ... and 1 more: https://git.openjdk.java.net/jdk17/compare/3e29056e...3bc5789c

Marked as reviewed by darcy (Reviewer).

-------------

PR: https://git.openjdk.java.net/jdk17/pull/126