RFR 8237599 : Greedy matching against supplementary chars fails to respect the region

Ivan Gerasimov ivan.gerasimov at oracle.com
Sat Mar 21 07:15:14 UTC 2020


Gentle ping.

The webrev was rebased to accommodate recent changes in RegExTest.java.

The fix is to handle an edge case situation, which is supposedly not too 
common.

Nevertheless, I think, it is important to handle it correctly.

Thanks in advance!

Ivan


On 1/22/20 8:23 PM, Ivan Gerasimov wrote:
> Hello everyone!
>
> When the input of a j.u.regex.Matcher is restricted with .region() 
> method, it can possibly cut off a half of a surrogate pair.
>
> It turns out that greedy matching implemented in the 
> Pattern.CharPropertyGreedy class fails to recognize this edge case in 
> two scenarios:
>
> 1) When it greedily consumes the input and meets a higher half of a 
> surrogate pair that was cut off at the end of input, and
>
> 2) When it backs off and meets a lower half of a surrogate pair at the 
> very beginning of input.
>
> In both cases, the engine reads the entire codepoint, crossing the 
> boundaries of the set region.
>
> Instead, it should only read the half of the surrogate pair that lies 
> inside the region and ignore the other half.
>
> Would you please help review the fix?
>
> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8237599
> WEBREV: http://cr.openjdk.java.net/~igerasim/8237599/00/webrev/
>
> Thanks in advance!
>
-- 
With kind regards,
Ivan Gerasimov



More information about the core-libs-dev mailing list