RFR 8235812 : (regex) Unicode linebreak with quantifier does not match valid input
Roger Riggs
Roger.Riggs at oracle.com
Mon Feb 10 21:11:40 UTC 2020
Hi Ivan,
This look fine.
In the test TegExTest: 5074, I would output the failed cases to System.err.
That way they get properly interleaved with the test progress output.
No need for another review.
Thanks, Roger
On 2/5/20 8:22 PM, Ivan Gerasimov wrote:
> Hello!
>
> j.u.regex.Pattern supports a special char class \R, which is specified
> to be equal to \u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029].
>
> In particular, this means that the input "\r\n" must match to both
> patterns \R and \R\R.
>
> (In the later case, first \R matches \r and second \R matches \n.)
>
> A pattern \R{2} is expected to be equal to \R\R.
>
> However with the current implementation this does not hold (so,
> Pattern.matches("\\R{2}", "\r\n") == false, while
> Pattern.matches("\\R\\R", "\r\n") == true).
>
> The root cause of this bug is that the special char class \R is
> handled via dedicated class LineEnding, which is not able to correctly
> handle backtracking in presence of quantifiers).
>
> A simple solution is to treat \R with quantifiers as an anonymous
> group, which will make it comply with the specification.
>
> Without quantifiers, \R is still handled via more efficient
> implementation of LineEnding.
>
> Would you please help review the fix?
>
> Some minor cleanup was done along the way in the affected code.
>
> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8235812
> WEBREV: http://cr.openjdk.java.net/~igerasim/8235812/00/webrev/
>
> Control build and testing (tiers1-4) are all green.
>
More information about the core-libs-dev
mailing list