RFR - JDK-8202442 - String::unescape (Code Review)

Jim Laskey james.laskey at oracle.com
Thu Sep 20 12:50:19 UTC 2018



> On Sep 19, 2018, at 7:21 PM, Stuart Marks <stuart.marks at oracle.com> wrote:
> 
> 
> 
> On 9/18/18 10:51 AM, Jim Laskey wrote:
>> Please review the code for String::unescape. Used to translate escape sequences in a string, typically in a raw string literal, into characters represented by those escapes.
>> webrev: http://cr.openjdk.java.net/~jlaskey/8202442/webrev/index.html
>> jbs: https://bugs.openjdk.java.net/browse/JDK-8202442
>> csr: https://bugs.openjdk.java.net/browse/JDK-8202443
> 
> Hi Jim,
> 
> For citing the JLS, there's a @jls javadoc tag that you might want to use. There are a couple usages elsewhere in String.java already.

Will add.

> 
> Is there going to be an escape() method that does the inverse of this? I thought that this was part of your original suite of string enhancements. Will this be proposed separately, or is it unnecessary?

The general feeling is that it is unnecessary. The inverse method is also fraught with danger; too many decision points on various characters. Ex.does ‘\r’ translate to ‘\r’ or '\013’ or `\u000D`, does ‘\0’ translate to ‘\0’ or’\u0000’.

> 
> 
> 2979      * Each unicode escape in the form \unnnn is translated to the
> 2980      * unicode character whose code point is {@code 0xnnnn}. Care should be
> 2981      * taken when using UTF-16 surrogate pairs to ensure that the high
> 2982      * surrogate (U+D800..U+DBFF) is immediately followed by a low surrogate
> 2983      * (U+DC00..U+DFFF) otherwise a
> 2984      * {@link java.nio.charset.CharacterCodingException} may occur during UTF-8
> 2985      * decoding.
> 
> 
> I know you're going to update this based on Naoto's comments, but I'd suggest rethinking this section. The \unnnn construct is called a "Unicode escape" per JLS 3.3, but how it's handled has little to do with Unicode. The nnnn digits are simply translated into a 16-bit 'char' value. Any such value will work, even if it's an invalid UTF-16 code unit (such as 0xFFF0) or an unpaired surrogate.
> 
> I believe this is consistent with the JLS treatment of \unnnn.
> 
> It might be sufficient to say that \unnnn is translated into a 16-bit 'char' value, and leave it at that.

Sure.

> 
> s'marks



More information about the core-libs-dev mailing list