RFR 8237599 : Greedy matching against supplementary chars fails to respect the region

Ivan Gerasimov ivan.gerasimov at oracle.com
Thu Jan 23 04:23:41 UTC 2020


Hello everyone!

When the input of a j.u.regex.Matcher is restricted with .region() 
method, it can possibly cut off a half of a surrogate pair.

It turns out that greedy matching implemented in the 
Pattern.CharPropertyGreedy class fails to recognize this edge case in 
two scenarios:

1) When it greedily consumes the input and meets a higher half of a 
surrogate pair that was cut off at the end of input, and

2) When it backs off and meets a lower half of a surrogate pair at the 
very beginning of input.

In both cases, the engine reads the entire codepoint, crossing the 
boundaries of the set region.

Instead, it should only read the half of the surrogate pair that lies 
inside the region and ignore the other half.

Would you please help review the fix?

BUGURL: https://bugs.openjdk.java.net/browse/JDK-8237599
WEBREV: http://cr.openjdk.java.net/~igerasim/8237599/00/webrev/

Thanks in advance!

-- 
With kind regards,
Ivan Gerasimov



More information about the core-libs-dev mailing list