6990617: Regular expression doesn't match if unicode character next to a digit.

Stephen Flores stephen.flores at oracle.com
Sat Dec 10 06:05:33 UTC 2011


Please review the following webrev (includes new test to demonstrate the 
bug):

   http://cr.openjdk.java.net/~sflores/6990617/

for bug: 6990617 Regular expression doesn't match if unicode character 
next to a digit.

A DESCRIPTION OF THE PROBLEM :

   Unicode characters are represented as \\+number.
   For instance, one could write:
             Pattern p = Pattern.compile("\\011some text\\012");
             Matcher m = p.matcher("\tsome text\n");
             System.out.println(m.find()); // yields "true"

   However, if we want to match a string with a digit next to
   the unicode character, it doesn't match (whether we "quote"
   the regular expression or not). Note the "1" next to the tab
   character (unicode 011).
             Pattern p = Pattern.compile("\\011\\Q1some text\\E\\012");
             Matcher m = p.matcher("\t1some text\n");
             System.out.println(m.find());  // yields "false"

   This happens because Pattern accepts either \\0011 or \\011 for
   the same character. From the javadoc:

     \0nn  The character with octal value 0nn (0 <= n <= 7)
     \0mnn The character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7)

Evaluation:

   Pattern.RemoveQEQuoting() does NOT handle this situation correctly.
   The existing implementation now simply copies all ASCII.isAlnum()
   characters when handing a quote.

Description of fix:

   In the method Pattern.RemoveQEQuoting any ASCII digit at the
   beginning of a quote will now be prefixed by "\x3" (not just \
   because this would be a backref). 0x30 is the ASCII code for '0'.

Thanks,

   Steve.



More information about the core-libs-dev mailing list