RFR 8235812 : (regex) Unicode linebreak with quantifier does not match valid input
Ivan Gerasimov
ivan.gerasimov at oracle.com
Thu Feb 6 01:22:02 UTC 2020
Hello!
j.u.regex.Pattern supports a special char class \R, which is specified
to be equal to \u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029].
In particular, this means that the input "\r\n" must match to both
patterns \R and \R\R.
(In the later case, first \R matches \r and second \R matches \n.)
A pattern \R{2} is expected to be equal to \R\R.
However with the current implementation this does not hold (so,
Pattern.matches("\\R{2}", "\r\n") == false, while
Pattern.matches("\\R\\R", "\r\n") == true).
The root cause of this bug is that the special char class \R is handled
via dedicated class LineEnding, which is not able to correctly handle
backtracking in presence of quantifiers).
A simple solution is to treat \R with quantifiers as an anonymous group,
which will make it comply with the specification.
Without quantifiers, \R is still handled via more efficient
implementation of LineEnding.
Would you please help review the fix?
Some minor cleanup was done along the way in the affected code.
BUGURL: https://bugs.openjdk.java.net/browse/JDK-8235812
WEBREV: http://cr.openjdk.java.net/~igerasim/8235812/00/webrev/
Control build and testing (tiers1-4) are all green.
--
With kind regards,
Ivan Gerasimov
More information about the core-libs-dev
mailing list