RFR 8237599 : Greedy matching against supplementary chars fails to respect the region
Hello everyone! When the input of a j.u.regex.Matcher is restricted with .region() method, it can possibly cut off a half of a surrogate pair. It turns out that greedy matching implemented in the Pattern.CharPropertyGreedy class fails to recognize this edge case in two scenarios: 1) When it greedily consumes the input and meets a higher half of a surrogate pair that was cut off at the end of input, and 2) When it backs off and meets a lower half of a surrogate pair at the very beginning of input. In both cases, the engine reads the entire codepoint, crossing the boundaries of the set region. Instead, it should only read the half of the surrogate pair that lies inside the region and ignore the other half. Would you please help review the fix? BUGURL: https://bugs.openjdk.java.net/browse/JDK-8237599 WEBREV: http://cr.openjdk.java.net/~igerasim/8237599/00/webrev/ Thanks in advance! -- With kind regards, Ivan Gerasimov
Gentle ping. The webrev was rebased to accommodate recent changes in RegExTest.java. The fix is to handle an edge case situation, which is supposedly not too common. Nevertheless, I think, it is important to handle it correctly. Thanks in advance! Ivan On 1/22/20 8:23 PM, Ivan Gerasimov wrote:
Hello everyone!
When the input of a j.u.regex.Matcher is restricted with .region() method, it can possibly cut off a half of a surrogate pair.
It turns out that greedy matching implemented in the Pattern.CharPropertyGreedy class fails to recognize this edge case in two scenarios:
1) When it greedily consumes the input and meets a higher half of a surrogate pair that was cut off at the end of input, and
2) When it backs off and meets a lower half of a surrogate pair at the very beginning of input.
In both cases, the engine reads the entire codepoint, crossing the boundaries of the set region.
Instead, it should only read the half of the surrogate pair that lies inside the region and ignore the other half.
Would you please help review the fix?
BUGURL: https://bugs.openjdk.java.net/browse/JDK-8237599 WEBREV: http://cr.openjdk.java.net/~igerasim/8237599/00/webrev/
Thanks in advance!
-- With kind regards, Ivan Gerasimov
Hi Ivan, Looks fine. Interesting edge case, would never be seen with 8 bit charsets. Thanks, Roger On 3/21/20 3:15 AM, Ivan Gerasimov wrote:
Gentle ping.
The webrev was rebased to accommodate recent changes in RegExTest.java.
The fix is to handle an edge case situation, which is supposedly not too common.
Nevertheless, I think, it is important to handle it correctly.
Thanks in advance!
Ivan
On 1/22/20 8:23 PM, Ivan Gerasimov wrote:
Hello everyone!
When the input of a j.u.regex.Matcher is restricted with .region() method, it can possibly cut off a half of a surrogate pair.
It turns out that greedy matching implemented in the Pattern.CharPropertyGreedy class fails to recognize this edge case in two scenarios:
1) When it greedily consumes the input and meets a higher half of a surrogate pair that was cut off at the end of input, and
2) When it backs off and meets a lower half of a surrogate pair at the very beginning of input.
In both cases, the engine reads the entire codepoint, crossing the boundaries of the set region.
Instead, it should only read the half of the surrogate pair that lies inside the region and ignore the other half.
Would you please help review the fix?
BUGURL: https://bugs.openjdk.java.net/browse/JDK-8237599 WEBREV: http://cr.openjdk.java.net/~igerasim/8237599/00/webrev/
Thanks in advance!
Thank you Roger for review! On 3/25/20 6:56 AM, Roger Riggs wrote:
Hi Ivan,
Looks fine.
Interesting edge case, would never be seen with 8 bit charsets.
Thanks, Roger
On 3/21/20 3:15 AM, Ivan Gerasimov wrote:
Gentle ping.
The webrev was rebased to accommodate recent changes in RegExTest.java.
The fix is to handle an edge case situation, which is supposedly not too common.
Nevertheless, I think, it is important to handle it correctly.
Thanks in advance!
Ivan
On 1/22/20 8:23 PM, Ivan Gerasimov wrote:
Hello everyone!
When the input of a j.u.regex.Matcher is restricted with .region() method, it can possibly cut off a half of a surrogate pair.
It turns out that greedy matching implemented in the Pattern.CharPropertyGreedy class fails to recognize this edge case in two scenarios:
1) When it greedily consumes the input and meets a higher half of a surrogate pair that was cut off at the end of input, and
2) When it backs off and meets a lower half of a surrogate pair at the very beginning of input.
In both cases, the engine reads the entire codepoint, crossing the boundaries of the set region.
Instead, it should only read the half of the surrogate pair that lies inside the region and ignore the other half.
Would you please help review the fix?
BUGURL: https://bugs.openjdk.java.net/browse/JDK-8237599 WEBREV: http://cr.openjdk.java.net/~igerasim/8237599/00/webrev/
Thanks in advance!
-- With kind regards, Ivan Gerasimov
participants (2)
-
Ivan Gerasimov
-
Roger Riggs