RFR 8235812 : (regex) Unicode linebreak with quantifier does not match valid input

Roger Riggs Roger.Riggs at oracle.com
Mon Feb 10 21:11:40 UTC 2020


Hi Ivan,

This look fine.

In the test TegExTest: 5074, I would output the failed cases to System.err.
That way they get properly interleaved with the test progress output.

No need for another review.

Thanks, Roger



On 2/5/20 8:22 PM, Ivan Gerasimov wrote:
> Hello!
>
> j.u.regex.Pattern supports a special char class \R, which is specified 
> to be equal to \u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029].
>
> In particular, this means that the input "\r\n" must match to both 
> patterns \R and \R\R.
>
> (In the later case, first \R matches \r and second \R matches \n.)
>
> A pattern \R{2} is expected to be equal to \R\R.
>
> However with the current implementation this does not hold (so, 
> Pattern.matches("\\R{2}", "\r\n") == false, while 
> Pattern.matches("\\R\\R", "\r\n") == true).
>
> The root cause of this bug is that the special char class \R is 
> handled via dedicated class LineEnding, which is not able to correctly 
> handle backtracking in  presence of quantifiers).
>
> A simple solution is to treat \R with quantifiers as an anonymous 
> group, which will make it comply with the specification.
>
> Without quantifiers, \R is still handled via more efficient 
> implementation of LineEnding.
>
> Would you please help review the fix?
>
> Some minor cleanup was done along the way in the affected code.
>
> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8235812
> WEBREV: http://cr.openjdk.java.net/~igerasim/8235812/00/webrev/
>
> Control build and testing (tiers1-4) are all green.
>



More information about the core-libs-dev mailing list