6990617: Regular expression doesn't match if unicode character next to a digit.

Thu Dec 15 18:44:15 UTC 2011

I would suggest to combine removeQEQuotingTest1 and 2 into one method, 
they kinda
look redundant.

Otherwise the change looks fine for me.

-Sherman

On 12/12/2011 08:16 PM, Stephen Flores wrote:
> Thanks Sherman,
>
> I have added the regression test for the case below and added a 
> "continue" statement after line 1622 to get the case to pass.
>
> I have updated the webrev.
>
> Steve.
>
> On 12/12/2011 02:22 PM, Xueming Shen wrote:
>> Hi Steve,
>>
>> The \x3[0-9] approach is interesting, it appears to solve the problem
>> without
>> paying a higher cost I originally thought (looking back, for example).
>>
>> The logic of initializing/setting/unsetting of "beginQuote" to
>> true/false appears to
>> be incorrect when there are multiple \Qn...\E in one pattern. Ln#1622
>> setting will
>> always be followed by Ln#1630, if I read the code correctly.
>>
>> For example
>>
>> Pattern pattern =
>> Pattern.compile("\\011\\Q1sometext\\E\\011\\Q2sometext\\E");
>> Matcher matcher = pattern.matcher("\t1sometext\t2sometext");
>> System.out.printf("find=%b%n", matcher.find());
>>
>> will still return false?
>>
>> -Sherman
>>
>> On 12/09/2011 10:05 PM, Stephen Flores wrote:
>>> Please review the following webrev (includes new test to demonstrate
>>> the bug):
>>>
>>> http://cr.openjdk.java.net/~sflores/6990617/
>>>
>>> for bug: 6990617 Regular expression doesn't match if unicode character
>>> next to a digit.
>>>
>>> A DESCRIPTION OF THE PROBLEM :
>>>
>>> Unicode characters are represented as \\+number.
>>> For instance, one could write:
>>> Pattern p = Pattern.compile("\\011some text\\012");
>>> Matcher m = p.matcher("\tsome text\n");
>>> System.out.println(m.find()); // yields "true"
>>>
>>> However, if we want to match a string with a digit next to
>>> the unicode character, it doesn't match (whether we "quote"
>>> the regular expression or not). Note the "1" next to the tab
>>> character (unicode 011).
>>> Pattern p = Pattern.compile("\\011\\Q1some text\\E\\012");
>>> Matcher m = p.matcher("\t1some text\n");
>>> System.out.println(m.find()); // yields "false"
>>>
>>> This happens because Pattern accepts either \\0011 or \\011 for
>>> the same character. From the javadoc:
>>>
>>> \0nn The character with octal value 0nn (0 <= n <= 7)
>>> \0mnn The character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7)
>>>
>>> Evaluation:
>>>
>>> Pattern.RemoveQEQuoting() does NOT handle this situation correctly.
>>> The existing implementation now simply copies all ASCII.isAlnum()
>>> characters when handing a quote.
>>>
>>> Description of fix:
>>>
>>> In the method Pattern.RemoveQEQuoting any ASCII digit at the
>>> beginning of a quote will now be prefixed by "\x3" (not just \
>>> because this would be a backref). 0x30 is the ASCII code for '0'.
>>>
>>> Thanks,
>>>
>>> Steve.
>>