<i18n dev> RL1.1 Hex Notation

Xueming Shen xueming.shen at oracle.com
Wed Jan 26 12:47:27 PST 2011


On 01/26/2011 11:50 AM, Mark Davis ☕ wrote:
> > I guess you are asking for something like?
>
> I'm not asking for that. What I'm saying is that as far as I can tell, 
> there is no way in Java to meet the terms of RL1.1, because there is 
> not a way to use hex numbers in any syntax for values above FFFF to 
> indicate literals. That is, if you supply "abc\\uD800\\uDC00def" then 
> regex fails.
>
> The code was my attempt to try to get something to work even using 
> separate surrogates (which was not the intent of RL1.1), but even that 
> failed. Maybe there is another way to do it?
>
> Mark
> //

Oh, I see the problem. Obviously I have been working on jdk7 too long 
and forgot the
latest release is still 6:-( There is indeed a bug in the previous 
implementation which I
fixed in 7 long time ago (I mentioned this in one of the early emails 
but was not specific,
my apology), probably should backport to 6 update release asap. The test 
case runs well
(the "failures" in literals are expected) on 7 with the following 
output. I modified your test
case "slightly" since it appears the UnicodeSet class in our normalizer 
package does not
have the size(), replace it with a normal hashset.

-Sherman

------------------------------------------------------------------
LITERALS Failures: 18
     set: [9, 10, 11, 12, 13, 32, 35, 36, 40, 41, 42, 43, 63, 91, 92, 
94, 123, 124]
     example1: a    b
     exampleN: a|b
INLINE Failures: 0
     set: []
     example1: null
     exampleN: null
INRANGE Failures: 0
     set: []
     example1: null
     exampleN: null

-----------------------------------------------------------------------
import java.util.regex.*;
import java.util.*;
import sun.text.normalizer.*;

public class TestRegex2 {

    public static void main(String[] args) {

         System.out.println("Check patterns for Unicodeset");

         for (int i = 0; i <= 0x10FFFF; ++i) {
             // The goal is to make a regex with hex digits, and have it 
match the corresponding character
             // We check two different environments: inline ("aXb") and 
in a range ("a[X]b")


             String s = new StringBuilder().appendCodePoint(i).toString();
             String hexPattern = i <= 0xFFFF ? "\\u" + Utility.hex(i,4)
                     : "\\u" + Utility.hex(Character.toChars(i)[0],4) + 
"\\u" + Utility.hex(Character.toChars(i)[1],4);

             String target = "a" + s + "b";

             Failures.LITERALS.checkMatch(i, "a" + s + "b", target);
             Failures.INLINE.checkMatch(i, "a" + hexPattern + "b", target);
             Failures.INRANGE.checkMatch(i, "a[" + hexPattern + "]b", 
target);
         }

         Failures.LITERALS.showFailures();
         Failures.INLINE.showFailures();
         Failures.INRANGE.showFailures();
     }


     static enum Failures {

         LITERALS, INLINE, INRANGE;

         Set<Integer> failureSet = new LinkedHashSet<Integer>();
         String firstSampleFailure;
         String lastSampleFailure;

         void checkMatch(int codePoint, String pattern, String target) {

             if (!matches(pattern, target)) {
                 failureSet.add(codePoint);
                 if (firstSampleFailure == null) {
                     firstSampleFailure = pattern;
                 }
                 lastSampleFailure = pattern;
             }
         }

         boolean matches(String hexPattern, String target) {
             try {
                 // use COMMENTS to get the 'worst case'
                 return Pattern.compile(hexPattern, 
Pattern.COMMENTS).matcher(target).matches();
             } catch (Exception e) {
                 return false;
             }
         }

         void showFailures() {
             System.out.format(this + " Failures: %s\n\tset: 
%s\n\texample1: %s\n\texampleN: %s\n",
                     failureSet.size(), failureSet, firstSampleFailure, 
lastSampleFailure);        }

     }

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/i18n-dev/attachments/20110126/edaf0411/attachment.html 


More information about the i18n-dev mailing list