RFR 8235812 : (regex) Unicode linebreak with quantifier does not match valid input
Ivan Gerasimov
ivan.gerasimov at oracle.com
Tue Feb 11 00:20:47 UTC 2020
Thank you Roger for review!
I've adjusted the test as you suggested and pushed the fix.
With kind regards,
Ivan
On 2/10/20 1:11 PM, Roger Riggs wrote:
> Hi Ivan,
>
> This look fine.
>
> In the test TegExTest: 5074, I would output the failed cases to
> System.err.
> That way they get properly interleaved with the test progress output.
>
> No need for another review.
>
> Thanks, Roger
>
>
>
> On 2/5/20 8:22 PM, Ivan Gerasimov wrote:
>> Hello!
>>
>> j.u.regex.Pattern supports a special char class \R, which is
>> specified to be equal to
>> \u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029].
>>
>> In particular, this means that the input "\r\n" must match to both
>> patterns \R and \R\R.
>>
>> (In the later case, first \R matches \r and second \R matches \n.)
>>
>> A pattern \R{2} is expected to be equal to \R\R.
>>
>> However with the current implementation this does not hold (so,
>> Pattern.matches("\\R{2}", "\r\n") == false, while
>> Pattern.matches("\\R\\R", "\r\n") == true).
>>
>> The root cause of this bug is that the special char class \R is
>> handled via dedicated class LineEnding, which is not able to
>> correctly handle backtracking in presence of quantifiers).
>>
>> A simple solution is to treat \R with quantifiers as an anonymous
>> group, which will make it comply with the specification.
>>
>> Without quantifiers, \R is still handled via more efficient
>> implementation of LineEnding.
>>
>> Would you please help review the fix?
>>
>> Some minor cleanup was done along the way in the affected code.
>>
>> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8235812
>> WEBREV: http://cr.openjdk.java.net/~igerasim/8235812/00/webrev/
>>
>> Control build and testing (tiers1-4) are all green.
>>
>
--
With kind regards,
Ivan Gerasimov
More information about the core-libs-dev
mailing list