<i18n dev> RL1.1 Hex Notation

Tue Jan 25 17:00:50 PST 2011

The goal of the clause is to have a mechanism for using hex values for
character literals. That is, you should be able to take a code point from 0
to 10FFFF, get a hex value for that, embed it in some syntax, and
concatenate it into a pattern, and have it work as a literal.

For example:

String pattern = first_part + "\\x{" + hex(myCodePoint) + "}" + second_part;
// for *some* hex notation
...
Matcher m = Pattern.compile(pattern, Pattern.COMMENTS).matcher(target);
...

As far as I can tell, Java really doesn't supply that capability for
non-BMP, because the \u notation doesn't work above FFFF, except insofar as
the preprocessor maps a surrogate pair in hex to literals, which happen all
to work because they aren't syntax characters.

What you can do with Java is:

   1. embed the character itself, not the hex representation, which works
   some of the time (fails for 18 characters; syntax characters, as expected).
   2. in constant expressions only, utilize the Java preprocessor with
   \u.... or \u....\u....).
   3. for BMP characters, use "\u" + hex(myCodePoint,4)

Here is a quick and dirty test; let me know if I've missed something.

*Output:*

 LITERALS Failures: 18

        set: [\u0009-\u000D\ #\$(-+?\[\\\^\{|]

        example1: a b

        exampleN: a|b

INLINE Failures: 1048576

        set: [\U00010000-\U0010FFFF]

        example1: a\uD800\uDC00b

        exampleN: a\uDBFF\uDFFFb

INRANGE Failures: 1048576

        set: [\U00010000-\U0010FFFF]

        example1: a[\uD800\uDC00]b

        exampleN: a[\uDBFF\uDFFF]b

*Code:*

    public void TestRegex() {

        logln("Check patterns for Unicodeset");

        for (int i = 0; i <= 0x10FFFF; ++i) {

            // The goal is to make a regex with hex digits, and have it
match the corresponding character

            // We check two different environments: inline ("aXb") and in a
range ("a[X]b")

            String s = new StringBuilder().appendCodePoint(i).toString();

            String hexPattern = i <= 0xFFFF ? "\\u" + Utility.hex(i,4)

                    : "\\u" + Utility.hex(Character.toChars(i)[0],4) +
"\\u"+ Utility.hex(Character.toChars(i)[1],4);

            String target = "a" + s + "b";

            Failures.LITERALS.checkMatch(i, "a" + s + "b", target);

            Failures.INLINE.checkMatch(i, "a" + hexPattern + "b", target);

            Failures.INRANGE.checkMatch(i, "a[" + hexPattern + "]b",
target);

        }

        Failures.LITERALS.showFailures();

        Failures.INLINE.showFailures();

        Failures.INRANGE.showFailures();

    }

    enum Failures {

        LITERALS, INLINE, INRANGE;

        UnicodeSet failureSet = new UnicodeSet();

        String firstSampleFailure;

        String lastSampleFailure;

        void checkMatch(int codePoint, String pattern, String target) {

            if (!matches(pattern, target)) {

                failureSet.add(codePoint);

                if (firstSampleFailure == null) {

                    firstSampleFailure = pattern;

                }

                lastSampleFailure = pattern;

            }

        }

        boolean matches(String hexPattern, String target) {

            try {

                // use COMMENTS to get the 'worst case'

                return Pattern.compile(hexPattern, Pattern.COMMENTS
).matcher(target).matches();

            } catch (Exception e) {

                return false;

            }

        }

        void showFailures() {

            System.out.format(this + " Failures: %s\n\tset: %s\n\texample1:
%s\n\texampleN: %s\n",
                    failureSet.size(), failureSet, firstSampleFailure,
lastSampleFailure);        }

     }
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/i18n-dev/attachments/20110125/063af1c2/attachment-0001.html