RFR: 8247546: Pattern matching does not skip correctly over supplementary characters
Joe Wang
huizhe.wang at oracle.com
Tue Jul 28 17:21:21 UTC 2020
Hi Naoto,
The new webrev looks good.
Best,
Joe
On 7/27/20 7:11 PM, naoto.sato at oracle.com wrote:
> Hi Joe,
>
> Thank you for the review!
>
> On 7/27/20 6:04 PM, Joe Wang wrote:
>> Hi Naoto,
>>
>> Looks good to me, using the correct supplementary-aware
>> implementation is an improvement for the impl to handle the situation.
>>
>> A note (probably a workaround) for the user's case, to replace
>> invalid surrogate characters, he could have used Unicode escape
>> sequences [\ud800-\udfff] instead.
>
> Yes.
>
>>
>> For the test, I wonder if it'd be better to combine the 1st and 2nd,
>> 3rd and 4th cases into: xxx\udca9\ud83d\udca9\ud83dyyy to represent
>> two lone surrogates plus a valid pair.
>
> If we combine test strings into the one mentioned above, I think it
> cannot distinguish the case we wanted to test. They all match with the
> first unpaired surrogate, i.e., \udca9.
>
>>
>> A minor comment on the comments of the test cases (in
>> SupplementaryTestCases.txt): instead of repeating the bug id (that
>> is: // @bug 8247546), it may be good to provide a short note (e.g.
>> match lone/invalid surrogates).
>
> Right, there is no reason to leave bug id since it can be found from
> hg log. Instead, I replaced them with more useful short notes.
>
> Revised webrev:
> https://cr.openjdk.java.net/~naoto/8247546/webrev.02/
>
> Naoto
>
>>
>> Regards,
>> Joe
>>
>> On 7/27/20 2:18 PM, naoto.sato at oracle.com wrote:
>>> On 7/27/20 11:51 AM, naoto.sato at oracle.com wrote:
>>>> Apart from the original issue, there was a bug in Range() method
>>>> (Pattern.java:5795), so it is fixed along.
>>>
>>> Created a test case for this:
>>>
>>> --- a/test/jdk/java/util/regex/SupplementaryTestCases.txt
>>> +++ b/test/jdk/java/util/regex/SupplementaryTestCases.txt
>>> @@ -149,6 +149,11 @@
>>> \ud83d\udca9
>>> false 0
>>>
>>> +// @bug 8247546
>>> +[\x{dc00}-\x{dfff}]
>>> +\ud83d\udca9
>>> +false 0
>>> +
>>> // use of x modifier
>>> \ud800\udc61bc(?x)bl\ud800\udc61h
>>> \ud800\udc61bcbl\ud800\udc61h
>>>
>>> Low surrogate range check falls into using BmpCharPredicate, which
>>> results in the same bug. The entire webrev is also revised:
>>>
>>> http://cr.openjdk.java.net/~naoto/8247546/webrev.01/
>>>
>>> Naoto
>>
More information about the core-libs-dev
mailing list