RFR 8235812 : (regex) Unicode linebreak with quantifier does not match valid input

Ivan Gerasimov ivan.gerasimov at oracle.com
Tue Feb 11 00:20:47 UTC 2020


Thank you Roger for review!

I've adjusted the test as you suggested and pushed the fix.

With kind regards,
Ivan

On 2/10/20 1:11 PM, Roger Riggs wrote:
> Hi Ivan,
>
> This look fine.
>
> In the test TegExTest: 5074, I would output the failed cases to 
> System.err.
> That way they get properly interleaved with the test progress output.
>
> No need for another review.
>
> Thanks, Roger
>
>
>
> On 2/5/20 8:22 PM, Ivan Gerasimov wrote:
>> Hello!
>>
>> j.u.regex.Pattern supports a special char class \R, which is 
>> specified to be equal to 
>> \u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029].
>>
>> In particular, this means that the input "\r\n" must match to both 
>> patterns \R and \R\R.
>>
>> (In the later case, first \R matches \r and second \R matches \n.)
>>
>> A pattern \R{2} is expected to be equal to \R\R.
>>
>> However with the current implementation this does not hold (so, 
>> Pattern.matches("\\R{2}", "\r\n") == false, while 
>> Pattern.matches("\\R\\R", "\r\n") == true).
>>
>> The root cause of this bug is that the special char class \R is 
>> handled via dedicated class LineEnding, which is not able to 
>> correctly handle backtracking in  presence of quantifiers).
>>
>> A simple solution is to treat \R with quantifiers as an anonymous 
>> group, which will make it comply with the specification.
>>
>> Without quantifiers, \R is still handled via more efficient 
>> implementation of LineEnding.
>>
>> Would you please help review the fix?
>>
>> Some minor cleanup was done along the way in the affected code.
>>
>> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8235812
>> WEBREV: http://cr.openjdk.java.net/~igerasim/8235812/00/webrev/
>>
>> Control build and testing (tiers1-4) are all green.
>>
>
-- 
With kind regards,
Ivan Gerasimov



More information about the core-libs-dev mailing list