RFR: 8247546: Pattern matching does not skip correctly over supplementary characters

Joe Wang huizhe.wang at oracle.com
Tue Jul 28 17:21:21 UTC 2020


Hi Naoto,

The new webrev looks good.

Best,
Joe

On 7/27/20 7:11 PM, naoto.sato at oracle.com wrote:
> Hi Joe,
>
> Thank you for the review!
>
> On 7/27/20 6:04 PM, Joe Wang wrote:
>> Hi Naoto,
>>
>> Looks good to me, using the correct supplementary-aware 
>> implementation is an improvement for the impl to handle the situation.
>>
>> A note (probably a workaround) for the user's case, to replace 
>> invalid surrogate characters, he could have used Unicode escape 
>> sequences [\ud800-\udfff] instead.
>
> Yes.
>
>>
>> For the test, I wonder if it'd be better to combine the 1st and 2nd, 
>> 3rd and 4th cases into: xxx\udca9\ud83d\udca9\ud83dyyy to represent 
>> two lone surrogates plus a valid pair.
>
> If we combine test strings into the one mentioned above, I think it 
> cannot distinguish the case we wanted to test. They all match with the 
> first unpaired surrogate, i.e., \udca9.
>
>>
>> A minor comment on the comments of the test cases (in 
>> SupplementaryTestCases.txt): instead of repeating the bug id (that 
>> is: // @bug 8247546), it may be good to provide a short note (e.g. 
>> match lone/invalid surrogates).
>
> Right, there is no reason to leave bug id since it can be found from 
> hg log. Instead, I replaced them with more useful short notes.
>
> Revised webrev:
> https://cr.openjdk.java.net/~naoto/8247546/webrev.02/
>
> Naoto
>
>>
>> Regards,
>> Joe
>>
>> On 7/27/20 2:18 PM, naoto.sato at oracle.com wrote:
>>> On 7/27/20 11:51 AM, naoto.sato at oracle.com wrote:
>>>> Apart from the original issue, there was a bug in Range() method 
>>>> (Pattern.java:5795), so it is fixed along.
>>>
>>> Created a test case for this:
>>>
>>> --- a/test/jdk/java/util/regex/SupplementaryTestCases.txt
>>> +++ b/test/jdk/java/util/regex/SupplementaryTestCases.txt
>>> @@ -149,6 +149,11 @@
>>>  \ud83d\udca9
>>>  false 0
>>>
>>> +// @bug 8247546
>>> +[\x{dc00}-\x{dfff}]
>>> +\ud83d\udca9
>>> +false 0
>>> +
>>>  // use of x modifier
>>>  \ud800\udc61bc(?x)bl\ud800\udc61h
>>>  \ud800\udc61bcbl\ud800\udc61h
>>>
>>> Low surrogate range check falls into using BmpCharPredicate, which 
>>> results in the same bug. The entire webrev is also revised:
>>>
>>> http://cr.openjdk.java.net/~naoto/8247546/webrev.01/
>>>
>>> Naoto
>>



More information about the core-libs-dev mailing list