RFR: JDK-8263261 Extend String::translateEscapes to support unicode escapes [v12]

Fri Jan 26 17:04:38 UTC 2024

On Fri, 26 Jan 2024 15:06:50 GMT, Jim Laskey <jlaskey at openjdk.org> wrote:

>> Currently String::translateEscapes does not support unicode escapes, reported as a IllegalArgumentException("Invalid escape sequence: ..."). String::translateEscapes should translate unicode escape sequences to provide full coverage,
>
> Jim Laskey has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 12 additional commits since the last revision:
> 
>  - Merge remote-tracking branch 'upstream/master' into 8263261
>  - Update unicode to Unicode
>  - Requested changes
>  - Update String.java
>  - Requested changes
>  - Update Copyright
>  - Update copyright year of test
>  - Add JLS Unicode Escapes reference
>  - Update comment
>  - Update copyright year
>  - ... and 2 more: https://git.openjdk.org/jdk/compare/b94b04ff...040bda82

src/java.base/share/classes/java/lang/String.java line 4229:

> 4227:      *     <th scope="row">{@code \u005Cu...uXXXX}</th>
> 4228:      *     <td>Unicode escape</td>
> 4229:      *     <td>single UTF-16 code unit equivalent</td>

The `...` makes it less clear what is being shown.  It might be clearer to include the XXXX in the resulting value and drop the multiple `u` case.

src/java.base/share/classes/java/lang/String.java line 4245:

> 4243:      * escape sequences and Unicode escapes are translated as encountered in one pass and
> 4244:      * <strong>not</strong> done as an Unicode escapes pass followed by an escape sequences
> 4245:      * pass.

I would move the description of the compiler behavior to the end and remove "also". For example, 
Suggestion:

     * @implNote As a convenience for use with constructed
     * strings, this method translates Unicode escapes. For example, this
     * method could be used when ASCII encoded text files need to maintain Unicode
     * content. The translation is done in a single pass and is non-recursive. That is,
     * escape sequences and Unicode escapes are translated as encountered in one pass and
     * <strong>not</strong> done as an Unicode escapes pass followed by an escape sequences
     * pass. By comparison, the compiler translates all Unicode escapes before string
     * literals are translated.

test/jdk/java/lang/String/TranslateEscapes.java line 97:

> 95:         verifyUnicodeEscape("\\u2022", "\u2022");
> 96:         verifyUnicodeEscape("\\ud83c\\udf09", "\ud83c\udf09");
> 97:         verifyUnicodeEscape("\\uuuuu2022", "\uuuuu2022");

Include the code from the example as a test case too.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/17491#discussion_r1467892757
PR Review Comment: https://git.openjdk.org/jdk/pull/17491#discussion_r1467895901
PR Review Comment: https://git.openjdk.org/jdk/pull/17491#discussion_r1467900516