6990617: Regular expression doesn't match if unicode character	next to a digit.
    Xueming Shen 
    xueming.shen at oracle.com
       
    Mon Dec 12 19:22:35 UTC 2011
    
    
  
Hi Steve,
The \x3[0-9] approach is interesting, it appears to solve the problem 
without
paying a higher cost I originally thought (looking back, for example).
The logic of initializing/setting/unsetting of "beginQuote" to 
true/false appears to
be incorrect when there are multiple \Qn...\E in one pattern. Ln#1622 
setting will
always be followed by Ln#1630, if I read the code correctly.
For example
         Pattern pattern = 
Pattern.compile("\\011\\Q1sometext\\E\\011\\Q2sometext\\E");
         Matcher matcher = pattern.matcher("\t1sometext\t2sometext");
         System.out.printf("find=%b%n", matcher.find());
will still return false?
-Sherman
On 12/09/2011 10:05 PM, Stephen Flores wrote:
> Please review the following webrev (includes new test to demonstrate 
> the bug):
>
>   http://cr.openjdk.java.net/~sflores/6990617/
>
> for bug: 6990617 Regular expression doesn't match if unicode character 
> next to a digit.
>
> A DESCRIPTION OF THE PROBLEM :
>
>   Unicode characters are represented as \\+number.
>   For instance, one could write:
>             Pattern p = Pattern.compile("\\011some text\\012");
>             Matcher m = p.matcher("\tsome text\n");
>             System.out.println(m.find()); // yields "true"
>
>   However, if we want to match a string with a digit next to
>   the unicode character, it doesn't match (whether we "quote"
>   the regular expression or not). Note the "1" next to the tab
>   character (unicode 011).
>             Pattern p = Pattern.compile("\\011\\Q1some text\\E\\012");
>             Matcher m = p.matcher("\t1some text\n");
>             System.out.println(m.find());  // yields "false"
>
>   This happens because Pattern accepts either \\0011 or \\011 for
>   the same character. From the javadoc:
>
>     \0nn  The character with octal value 0nn (0 <= n <= 7)
>     \0mnn The character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7)
>
> Evaluation:
>
>   Pattern.RemoveQEQuoting() does NOT handle this situation correctly.
>   The existing implementation now simply copies all ASCII.isAlnum()
>   characters when handing a quote.
>
> Description of fix:
>
>   In the method Pattern.RemoveQEQuoting any ASCII digit at the
>   beginning of a quote will now be prefixed by "\x3" (not just \
>   because this would be a backref). 0x30 is the ASCII code for '0'.
>
> Thanks,
>
>   Steve.
    
    
More information about the core-libs-dev
mailing list