RFR 8237599 : Greedy matching against supplementary chars fails to respect the region
Roger Riggs
Roger.Riggs at oracle.com
Wed Mar 25 13:56:59 UTC 2020
Hi Ivan,
Looks fine.
Interesting edge case, would never be seen with 8 bit charsets.
Thanks, Roger
On 3/21/20 3:15 AM, Ivan Gerasimov wrote:
> Gentle ping.
>
> The webrev was rebased to accommodate recent changes in RegExTest.java.
>
> The fix is to handle an edge case situation, which is supposedly not
> too common.
>
> Nevertheless, I think, it is important to handle it correctly.
>
> Thanks in advance!
>
> Ivan
>
>
> On 1/22/20 8:23 PM, Ivan Gerasimov wrote:
>> Hello everyone!
>>
>> When the input of a j.u.regex.Matcher is restricted with .region()
>> method, it can possibly cut off a half of a surrogate pair.
>>
>> It turns out that greedy matching implemented in the
>> Pattern.CharPropertyGreedy class fails to recognize this edge case in
>> two scenarios:
>>
>> 1) When it greedily consumes the input and meets a higher half of a
>> surrogate pair that was cut off at the end of input, and
>>
>> 2) When it backs off and meets a lower half of a surrogate pair at
>> the very beginning of input.
>>
>> In both cases, the engine reads the entire codepoint, crossing the
>> boundaries of the set region.
>>
>> Instead, it should only read the half of the surrogate pair that lies
>> inside the region and ignore the other half.
>>
>> Would you please help review the fix?
>>
>> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8237599
>> WEBREV: http://cr.openjdk.java.net/~igerasim/8237599/00/webrev/
>>
>> Thanks in advance!
>>
More information about the core-libs-dev
mailing list