RFR: 8247546: Pattern matching does not skip correctly over supplementary characters

naoto.sato at oracle.com naoto.sato at oracle.com
Tue Jul 28 02:11:38 UTC 2020


Hi Joe,

Thank you for the review!

On 7/27/20 6:04 PM, Joe Wang wrote:
> Hi Naoto,
> 
> Looks good to me, using the correct supplementary-aware implementation 
> is an improvement for the impl to handle the situation.
> 
> A note (probably a workaround) for the user's case, to replace invalid 
> surrogate characters, he could have used Unicode escape sequences 
> [\ud800-\udfff] instead.

Yes.

> 
> For the test, I wonder if it'd be better to combine the 1st and 2nd, 3rd 
> and 4th cases into: xxx\udca9\ud83d\udca9\ud83dyyy to represent two lone 
> surrogates plus a valid pair.

If we combine test strings into the one mentioned above, I think it 
cannot distinguish the case we wanted to test. They all match with the 
first unpaired surrogate, i.e., \udca9.

> 
> A minor comment on the comments of the test cases (in 
> SupplementaryTestCases.txt): instead of repeating the bug id (that is: 
> // @bug 8247546), it may be good to provide a short note (e.g. match 
> lone/invalid surrogates).

Right, there is no reason to leave bug id since it can be found from hg 
log. Instead, I replaced them with more useful short notes.

Revised webrev:
https://cr.openjdk.java.net/~naoto/8247546/webrev.02/

Naoto

> 
> Regards,
> Joe
> 
> On 7/27/20 2:18 PM, naoto.sato at oracle.com wrote:
>> On 7/27/20 11:51 AM, naoto.sato at oracle.com wrote:
>>> Apart from the original issue, there was a bug in Range() method 
>>> (Pattern.java:5795), so it is fixed along.
>>
>> Created a test case for this:
>>
>> --- a/test/jdk/java/util/regex/SupplementaryTestCases.txt
>> +++ b/test/jdk/java/util/regex/SupplementaryTestCases.txt
>> @@ -149,6 +149,11 @@
>>  \ud83d\udca9
>>  false 0
>>
>> +// @bug 8247546
>> +[\x{dc00}-\x{dfff}]
>> +\ud83d\udca9
>> +false 0
>> +
>>  // use of x modifier
>>  \ud800\udc61bc(?x)bl\ud800\udc61h
>>  \ud800\udc61bcbl\ud800\udc61h
>>
>> Low surrogate range check falls into using BmpCharPredicate, which 
>> results in the same bug. The entire webrev is also revised:
>>
>> http://cr.openjdk.java.net/~naoto/8247546/webrev.01/
>>
>> Naoto
> 


More information about the core-libs-dev mailing list