RFR 8235812 : (regex) Unicode linebreak with quantifier does not match valid input

Ivan Gerasimov ivan.gerasimov at oracle.com
Thu Feb 6 01:22:02 UTC 2020


j.u.regex.Pattern supports a special char class \R, which is specified 
to be equal to \u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029].

In particular, this means that the input "\r\n" must match to both 
patterns \R and \R\R.

(In the later case, first \R matches \r and second \R matches \n.)

A pattern \R{2} is expected to be equal to \R\R.

However with the current implementation this does not hold (so, 
Pattern.matches("\\R{2}", "\r\n") == false, while 
Pattern.matches("\\R\\R", "\r\n") == true).

The root cause of this bug is that the special char class \R is handled 
via dedicated class LineEnding, which is not able to correctly handle 
backtracking in  presence of quantifiers).

A simple solution is to treat \R with quantifiers as an anonymous group, 
which will make it comply with the specification.

Without quantifiers, \R is still handled via more efficient 
implementation of LineEnding.

Would you please help review the fix?

Some minor cleanup was done along the way in the affected code.

BUGURL: https://bugs.openjdk.java.net/browse/JDK-8235812
WEBREV: http://cr.openjdk.java.net/~igerasim/8235812/00/webrev/

Control build and testing (tiers1-4) are all green.

With kind regards,
Ivan Gerasimov

More information about the core-libs-dev mailing list