RFR 8237599 : Greedy matching against supplementary chars fails to respect the region
Ivan Gerasimov
ivan.gerasimov at oracle.com
Sat Mar 21 07:15:14 UTC 2020
Gentle ping.
The webrev was rebased to accommodate recent changes in RegExTest.java.
The fix is to handle an edge case situation, which is supposedly not too
common.
Nevertheless, I think, it is important to handle it correctly.
Thanks in advance!
Ivan
On 1/22/20 8:23 PM, Ivan Gerasimov wrote:
> Hello everyone!
>
> When the input of a j.u.regex.Matcher is restricted with .region()
> method, it can possibly cut off a half of a surrogate pair.
>
> It turns out that greedy matching implemented in the
> Pattern.CharPropertyGreedy class fails to recognize this edge case in
> two scenarios:
>
> 1) When it greedily consumes the input and meets a higher half of a
> surrogate pair that was cut off at the end of input, and
>
> 2) When it backs off and meets a lower half of a surrogate pair at the
> very beginning of input.
>
> In both cases, the engine reads the entire codepoint, crossing the
> boundaries of the set region.
>
> Instead, it should only read the half of the surrogate pair that lies
> inside the region and ignore the other half.
>
> Would you please help review the fix?
>
> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8237599
> WEBREV: http://cr.openjdk.java.net/~igerasim/8237599/00/webrev/
>
> Thanks in advance!
>
--
With kind regards,
Ivan Gerasimov
More information about the core-libs-dev
mailing list