<i18n dev> RL1.1 Hex Notation

Mark Davis ☕ mark at macchiato.com
Wed Jan 26 11:50:15 PST 2011


> I guess you are asking for something like?

I'm not asking for that. What I'm saying is that as far as I can tell, there
is no way in Java to meet the terms of RL1.1, because there is not a way to
use hex numbers in any syntax for values above FFFF to indicate literals.
That is, if you supply "abc\\uD800\\uDC00def" then regex fails.

The code was my attempt to try to get something to work even using separate
surrogates (which was not the intent of RL1.1), but even that failed. Maybe
there is another way to do it?

Mark

*— Il meglio è l’inimico del bene —*


On Tue, Jan 25, 2011 at 17:47, Xueming Shen <xueming.shen at oracle.com> wrote:

>  Hi Mark,
>
> I guess you are asking for something like?
>
>         char[] cc = Character.toChars(0x12345);
>         Matcher m = Pattern.compile("["
>                                      + "\\u" + HEX(cc[0])
>                                      + "\\u" + HEX(cc[1])
>                                      + "]").matcher("");
>         System.out.println("find=" + m.reset("abc[" + new String(cc) +
> "]efg").find());
>
> in which the HEX should be something like below to make it a nnnn.
>
>     static String HEX(char c) {
>         StringBuilder sb = new StringBuilder();
>         Formatter fm = new Formatter(sb);
>         fm.format("%04x", (int)c);
>         return sb.toString();
>     }
>
> It looks a little tedious, you will probably also have to differentiate bmp
> or supplementary
> to decide to feed in one utf16 hex or a pair, just to show you can still
> use Java Unicode
> escape to embed the hex values of the utf16 instead of the "character
> itself". Does
> it qualify for the RL1.1?
>
> Sure, \x{...} looks more straightforward and convenient. As I said in
> previous email
> exchange I totally agree it will be a nice enhancement for Java RegEx.
>
> -Sherman
>
>
> On 1-25-2011 17:00 05:00 PM, Mark Davis ☕ wrote:
>
> The goal of the clause is to have a mechanism for using hex values for
> character literals. That is, you should be able to take a code point from
> 0 to 10FFFF, get a hex value for that, embed it in some syntax, and
> concatenate it into a pattern, and have it work as a literal.
>
>  For example:
>
>  String pattern = first_part + "\\x{" + hex(myCodePoint) + "}" +
> second_part; // for *some* hex notation
> ...
> Matcher m = Pattern.compile(pattern, Pattern.COMMENTS).matcher(target);
> ...
>
>
>  As far as I can tell, Java really doesn't supply that capability for
> non-BMP, because the \u notation doesn't work above FFFF, except insofar as
> the preprocessor maps a surrogate pair in hex to literals, which happen all
> to work because they aren't syntax characters.
>
>  What you can do with Java is:
>
>    1. embed the character itself, not the hex representation, which works
>    some of the time (fails for 18 characters; syntax characters, as expected).
>    2. in constant expressions only, utilize the Java preprocessor with
>    \u.... or \u....\u....).
>    3. for BMP characters, use "\u" + hex(myCodePoint,4)
>
> Here is a quick and dirty test; let me know if I've missed something.
>
>  *Output:*
>
>   LITERALS Failures: 18
>
>         set: [\u0009-\u000D\ #\$(-+?\[\\\^\{|]
>
>         example1: a b
>
>         exampleN: a|b
>
> INLINE Failures: 1048576
>
>         set: [\U00010000-\U0010FFFF]
>
>         example1: a\uD800\uDC00b
>
>         exampleN: a\uDBFF\uDFFFb
>
> INRANGE Failures: 1048576
>
>         set: [\U00010000-\U0010FFFF]
>
>         example1: a[\uD800\uDC00]b
>
>         exampleN: a[\uDBFF\uDFFF]b
>
>
>  *Code:*
>
>      public void TestRegex() {
>
>         logln("Check patterns for Unicodeset");
>
>
>          for (int i = 0; i <= 0x10FFFF; ++i) {
>
>
>              // The goal is to make a regex with hex digits, and have it
> match the corresponding character
>
>             // We check two different environments: inline ("aXb") and in
> a range ("a[X]b")
>
>
>              String s = new StringBuilder().appendCodePoint(i).toString();
>
>
>              String hexPattern = i <= 0xFFFF ? "\\u" + Utility.hex(i,4)
>
>                     : "\\u" + Utility.hex(Character.toChars(i)[0],4) +
> "\\u" + Utility.hex(Character.toChars(i)[1],4);
>
>
>              String target = "a" + s + "b";
>
>
>              Failures.LITERALS.checkMatch(i, "a" + s + "b", target);
>
>             Failures.INLINE.checkMatch(i, "a" + hexPattern + "b", target);
>
>             Failures.INRANGE.checkMatch(i, "a[" + hexPattern + "]b",
> target);
>
>         }
>
>         Failures.LITERALS.showFailures();
>
>         Failures.INLINE.showFailures();
>
>         Failures.INRANGE.showFailures();
>
>     }
>
>
>      enum Failures {
>
>         LITERALS, INLINE, INRANGE;
>
>         UnicodeSet failureSet = new UnicodeSet();
>
>         String firstSampleFailure;
>
>         String lastSampleFailure;
>
>
>          void checkMatch(int codePoint, String pattern, String target) {
>
>             if (!matches(pattern, target)) {
>
>                 failureSet.add(codePoint);
>
>                 if (firstSampleFailure == null) {
>
>                     firstSampleFailure = pattern;
>
>                 }
>
>                 lastSampleFailure = pattern;
>
>             }
>
>         }
>
>         boolean matches(String hexPattern, String target) {
>
>             try {
>
>                 // use COMMENTS to get the 'worst case'
>
>                 return Pattern.compile(hexPattern, Pattern.COMMENTS
> ).matcher(target).matches();
>
>             } catch (Exception e) {
>
>                 return false;
>
>             }
>
>         }
>
>         void showFailures() {
>
>              System.out.format(this + " Failures: %s\n\tset:
> %s\n\texample1: %s\n\texampleN: %s\n",
>                     failureSet.size(), failureSet, firstSampleFailure,
> lastSampleFailure);        }
>
>      }
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/i18n-dev/attachments/20110126/9312efe1/attachment.html 


More information about the i18n-dev mailing list