RFR 8237599 : Greedy matching against supplementary chars fails to respect the region

Ivan Gerasimov ivan.gerasimov at oracle.com
Wed Mar 25 15:47:57 UTC 2020


Thank you Roger for review!


On 3/25/20 6:56 AM, Roger Riggs wrote:
> Hi Ivan,
>
> Looks fine.
>
> Interesting edge case, would never be seen with 8 bit charsets.
>
> Thanks, Roger
>
>
> On 3/21/20 3:15 AM, Ivan Gerasimov wrote:
>> Gentle ping.
>>
>> The webrev was rebased to accommodate recent changes in RegExTest.java.
>>
>> The fix is to handle an edge case situation, which is supposedly not 
>> too common.
>>
>> Nevertheless, I think, it is important to handle it correctly.
>>
>> Thanks in advance!
>>
>> Ivan
>>
>>
>> On 1/22/20 8:23 PM, Ivan Gerasimov wrote:
>>> Hello everyone!
>>>
>>> When the input of a j.u.regex.Matcher is restricted with .region() 
>>> method, it can possibly cut off a half of a surrogate pair.
>>>
>>> It turns out that greedy matching implemented in the 
>>> Pattern.CharPropertyGreedy class fails to recognize this edge case 
>>> in two scenarios:
>>>
>>> 1) When it greedily consumes the input and meets a higher half of a 
>>> surrogate pair that was cut off at the end of input, and
>>>
>>> 2) When it backs off and meets a lower half of a surrogate pair at 
>>> the very beginning of input.
>>>
>>> In both cases, the engine reads the entire codepoint, crossing the 
>>> boundaries of the set region.
>>>
>>> Instead, it should only read the half of the surrogate pair that 
>>> lies inside the region and ignore the other half.
>>>
>>> Would you please help review the fix?
>>>
>>> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8237599
>>> WEBREV: http://cr.openjdk.java.net/~igerasim/8237599/00/webrev/
>>>
>>> Thanks in advance!
>>>
>
-- 
With kind regards,
Ivan Gerasimov



More information about the core-libs-dev mailing list