RFR - JDK-8202442 - String::unescape (Code Review)

Wed Sep 19 22:21:18 UTC 2018

On 9/18/18 10:51 AM, Jim Laskey wrote:
> Please review the code for String::unescape. Used to translate escape sequences in a string, typically in a raw string literal, into characters represented by those escapes.
> 
> webrev: http://cr.openjdk.java.net/~jlaskey/8202442/webrev/index.html
> jbs: https://bugs.openjdk.java.net/browse/JDK-8202442
> csr: https://bugs.openjdk.java.net/browse/JDK-8202443

Hi Jim,

For citing the JLS, there's a @jls javadoc tag that you might want to use. There 
are a couple usages elsewhere in String.java already.

Is there going to be an escape() method that does the inverse of this? I thought 
that this was part of your original suite of string enhancements. Will this be 
proposed separately, or is it unnecessary?

2979      * Each unicode escape in the form \unnnn is translated to the
2980      * unicode character whose code point is {@code 0xnnnn}. Care should be
2981      * taken when using UTF-16 surrogate pairs to ensure that the high
2982      * surrogate (U+D800..U+DBFF) is immediately followed by a low surrogate
2983      * (U+DC00..U+DFFF) otherwise a
2984      * {@link java.nio.charset.CharacterCodingException} may occur during UTF-8
2985      * decoding.

I know you're going to update this based on Naoto's comments, but I'd suggest 
rethinking this section. The \unnnn construct is called a "Unicode escape" per 
JLS 3.3, but how it's handled has little to do with Unicode. The nnnn digits are 
simply translated into a 16-bit 'char' value. Any such value will work, even if 
it's an invalid UTF-16 code unit (such as 0xFFF0) or an unpaired surrogate.

I believe this is consistent with the JLS treatment of \unnnn.

It might be sufficient to say that \unnnn is translated into a 16-bit 'char' 
value, and leave it at that.

s'marks