6990617: Regular expression doesn't match if unicode character next to a digit.

Stephen Flores stephen.flores at oracle.com
Fri Dec 16 06:41:10 UTC 2011


I removed test 1 since test tests the same state.

I updated the webrev.

Steve.

On 12/15/2011 01:44 PM, Xueming Shen wrote:
> I would suggest to combine removeQEQuotingTest1 and 2 into one method,
> they kinda
> look redundant.
>
> Otherwise the change looks fine for me.
>
> -Sherman
>
>
> On 12/12/2011 08:16 PM, Stephen Flores wrote:
>> Thanks Sherman,
>>
>> I have added the regression test for the case below and added a
>> "continue" statement after line 1622 to get the case to pass.
>>
>> I have updated the webrev.
>>
>> Steve.
>>
>> On 12/12/2011 02:22 PM, Xueming Shen wrote:
>>> Hi Steve,
>>>
>>> The \x3[0-9] approach is interesting, it appears to solve the problem
>>> without
>>> paying a higher cost I originally thought (looking back, for example).
>>>
>>> The logic of initializing/setting/unsetting of "beginQuote" to
>>> true/false appears to
>>> be incorrect when there are multiple \Qn...\E in one pattern. Ln#1622
>>> setting will
>>> always be followed by Ln#1630, if I read the code correctly.
>>>
>>> For example
>>>
>>> Pattern pattern =
>>> Pattern.compile("\\011\\Q1sometext\\E\\011\\Q2sometext\\E");
>>> Matcher matcher = pattern.matcher("\t1sometext\t2sometext");
>>> System.out.printf("find=%b%n", matcher.find());
>>>
>>> will still return false?
>>>
>>> -Sherman
>>>
>>> On 12/09/2011 10:05 PM, Stephen Flores wrote:
>>>> Please review the following webrev (includes new test to demonstrate
>>>> the bug):
>>>>
>>>> http://cr.openjdk.java.net/~sflores/6990617/
>>>>
>>>> for bug: 6990617 Regular expression doesn't match if unicode character
>>>> next to a digit.
>>>>
>>>> A DESCRIPTION OF THE PROBLEM :
>>>>
>>>> Unicode characters are represented as \\+number.
>>>> For instance, one could write:
>>>> Pattern p = Pattern.compile("\\011some text\\012");
>>>> Matcher m = p.matcher("\tsome text\n");
>>>> System.out.println(m.find()); // yields "true"
>>>>
>>>> However, if we want to match a string with a digit next to
>>>> the unicode character, it doesn't match (whether we "quote"
>>>> the regular expression or not). Note the "1" next to the tab
>>>> character (unicode 011).
>>>> Pattern p = Pattern.compile("\\011\\Q1some text\\E\\012");
>>>> Matcher m = p.matcher("\t1some text\n");
>>>> System.out.println(m.find()); // yields "false"
>>>>
>>>> This happens because Pattern accepts either \\0011 or \\011 for
>>>> the same character. From the javadoc:
>>>>
>>>> \0nn The character with octal value 0nn (0 <= n <= 7)
>>>> \0mnn The character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7)
>>>>
>>>> Evaluation:
>>>>
>>>> Pattern.RemoveQEQuoting() does NOT handle this situation correctly.
>>>> The existing implementation now simply copies all ASCII.isAlnum()
>>>> characters when handing a quote.
>>>>
>>>> Description of fix:
>>>>
>>>> In the method Pattern.RemoveQEQuoting any ASCII digit at the
>>>> beginning of a quote will now be prefixed by "\x3" (not just \
>>>> because this would be a backref). 0x30 is the ASCII code for '0'.
>>>>
>>>> Thanks,
>>>>
>>>> Steve.
>>>
>



More information about the core-libs-dev mailing list