RFR 8235812 : (regex) Unicode linebreak with quantifier does not match valid input

Roger Riggs Roger.Riggs at oracle.com
Mon Feb 10 21:11:40 UTC 2020

Hi Ivan,

This look fine.

In the test TegExTest: 5074, I would output the failed cases to System.err.
That way they get properly interleaved with the test progress output.

No need for another review.

Thanks, Roger

On 2/5/20 8:22 PM, Ivan Gerasimov wrote:
> Hello!
> j.u.regex.Pattern supports a special char class \R, which is specified 
> to be equal to \u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029].
> In particular, this means that the input "\r\n" must match to both 
> patterns \R and \R\R.
> (In the later case, first \R matches \r and second \R matches \n.)
> A pattern \R{2} is expected to be equal to \R\R.
> However with the current implementation this does not hold (so, 
> Pattern.matches("\\R{2}", "\r\n") == false, while 
> Pattern.matches("\\R\\R", "\r\n") == true).
> The root cause of this bug is that the special char class \R is 
> handled via dedicated class LineEnding, which is not able to correctly 
> handle backtracking in  presence of quantifiers).
> A simple solution is to treat \R with quantifiers as an anonymous 
> group, which will make it comply with the specification.
> Without quantifiers, \R is still handled via more efficient 
> implementation of LineEnding.
> Would you please help review the fix?
> Some minor cleanup was done along the way in the affected code.
> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8235812
> WEBREV: http://cr.openjdk.java.net/~igerasim/8235812/00/webrev/
> Control build and testing (tiers1-4) are all green.

More information about the core-libs-dev mailing list