RFR: 8247546: Pattern matching does not skip correctly over supplementary characters

Tue Jul 28 01:04:14 UTC 2020

Hi Naoto,

Looks good to me, using the correct supplementary-aware implementation 
is an improvement for the impl to handle the situation.

A note (probably a workaround) for the user's case, to replace invalid 
surrogate characters, he could have used Unicode escape sequences 
[\ud800-\udfff] instead.

For the test, I wonder if it'd be better to combine the 1st and 2nd, 3rd 
and 4th cases into: xxx\udca9\ud83d\udca9\ud83dyyy to represent two lone 
surrogates plus a valid pair.

A minor comment on the comments of the test cases (in 
SupplementaryTestCases.txt): instead of repeating the bug id (that is: 
// @bug 8247546), it may be good to provide a short note (e.g. match 
lone/invalid surrogates).

Regards,
Joe

On 7/27/20 2:18 PM, naoto.sato at oracle.com wrote:
> On 7/27/20 11:51 AM, naoto.sato at oracle.com wrote:
>> Apart from the original issue, there was a bug in Range() method 
>> (Pattern.java:5795), so it is fixed along.
>
> Created a test case for this:
>
> --- a/test/jdk/java/util/regex/SupplementaryTestCases.txt
> +++ b/test/jdk/java/util/regex/SupplementaryTestCases.txt
> @@ -149,6 +149,11 @@
>  \ud83d\udca9
>  false 0
>
> +// @bug 8247546
> +[\x{dc00}-\x{dfff}]
> +\ud83d\udca9
> +false 0
> +
>  // use of x modifier
>  \ud800\udc61bc(?x)bl\ud800\udc61h
>  \ud800\udc61bcbl\ud800\udc61h
>
> Low surrogate range check falls into using BmpCharPredicate, which 
> results in the same bug. The entire webrev is also revised:
>
> http://cr.openjdk.java.net/~naoto/8247546/webrev.01/
>
> Naoto