RFR 8237599 : Greedy matching against supplementary chars fails to respect the region
Ivan Gerasimov
ivan.gerasimov at oracle.com
Wed Mar 25 15:47:57 UTC 2020
Thank you Roger for review!
On 3/25/20 6:56 AM, Roger Riggs wrote:
> Hi Ivan,
>
> Looks fine.
>
> Interesting edge case, would never be seen with 8 bit charsets.
>
> Thanks, Roger
>
>
> On 3/21/20 3:15 AM, Ivan Gerasimov wrote:
>> Gentle ping.
>>
>> The webrev was rebased to accommodate recent changes in RegExTest.java.
>>
>> The fix is to handle an edge case situation, which is supposedly not
>> too common.
>>
>> Nevertheless, I think, it is important to handle it correctly.
>>
>> Thanks in advance!
>>
>> Ivan
>>
>>
>> On 1/22/20 8:23 PM, Ivan Gerasimov wrote:
>>> Hello everyone!
>>>
>>> When the input of a j.u.regex.Matcher is restricted with .region()
>>> method, it can possibly cut off a half of a surrogate pair.
>>>
>>> It turns out that greedy matching implemented in the
>>> Pattern.CharPropertyGreedy class fails to recognize this edge case
>>> in two scenarios:
>>>
>>> 1) When it greedily consumes the input and meets a higher half of a
>>> surrogate pair that was cut off at the end of input, and
>>>
>>> 2) When it backs off and meets a lower half of a surrogate pair at
>>> the very beginning of input.
>>>
>>> In both cases, the engine reads the entire codepoint, crossing the
>>> boundaries of the set region.
>>>
>>> Instead, it should only read the half of the surrogate pair that
>>> lies inside the region and ignore the other half.
>>>
>>> Would you please help review the fix?
>>>
>>> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8237599
>>> WEBREV: http://cr.openjdk.java.net/~igerasim/8237599/00/webrev/
>>>
>>> Thanks in advance!
>>>
>
--
With kind regards,
Ivan Gerasimov
More information about the core-libs-dev
mailing list