<i18n dev> RL1.1 Hex Notation

Xueming Shen xueming.shen at oracle.com
Thu Jan 27 15:13:51 PST 2011


I run

     public static void main(String[] args) {

         test("\uD800\uDF3C", "^\\x{1033c}$");
         test("\uD800\uDF3C", "^\\xF0\\x90\\x8C\\xBC$");
         test("\uD800\uDF3C", "^\\x{D800}\\x{DF3c}+$");
         test("\uD800\uDF3C", "^[\\x{D800}\\x{DF3c}]+$");
         test("\uD800\uDF3C", "^\\xF0\\x90\\x8C\\xBC$");
         test("\uD800\uDF3C", "^[\\xF0\\x90\\x8C\\xBC]+$");
         test("\uD800\uDF3C", "^[\\x{D800}\\x{DF3C}]+$");
         test("\uD800\uDF3C", "^[\\x{DF3C}\\x{D800}]+$");
         test("\uDF3C\uD800", "^[\\x{D800}\\x{DF3C}]+$");
         test("\uDF3C\uD800", "^[\\x{DF3C}\\x{D800}]+$");

     }

     static void test(String text, String pattern) {
         System.out.println(Pattern.matches(pattern, text));
     }

It yields

true
false
false
false
false
false
false
false
true
true

The difference is at

         test("\uD800\uDF3C", "^[\\x{D800}\\x{DF3C}]+$");
         test("\uD800\uDF3C", "^[\\x{DF3C}\\x{D800}]+$");

You can have unpaird surrogate in Java String, but if you have a paired one
you can't say I want them to be two separated "unpaired" surrogates.

Pretty close, right? sure you would need the \x{...} patch:-) as I'm 
preparing it at
http://cr.openjdk.java.net/~sherman/7014645/ 
<http://cr.openjdk.java.net/%7Esherman/7014645/>

Yes,the [\\uhhhh\\ullll] pair inside class is tricky, the implementation 
can't tell if you
want paired or unpaired,  the current implementation treats them as a paired
surrogates -> a supplementary character. An alternative is to write them as
union [[\\uhhhh][\\ullll]\\uhhhh\\ullll], if you also want to match the 
"unpaired"
surrogate in a string

for example

         test("\uD800\uDF3C", "^[[\\uD800][\\uDF3C]\\uD800\\uDF3C]+$");
         test("\uDF3C\uD800", "^[[\\uD800][\\uDF3C]\\uD800\\uDF3C]+$");

I assume if I can have this \x{...} in (7), we all agree we are done 
with RL1.1?:-)

-Sherman

On 01/27/2011 12:48 PM, Tom Christiansen wrote:
>
>
>> on 7 with the following output. I modified your test case "slightly"
>> since it appears the UnicodeSet class in our normalizer package does
>> not have the size(), replace it with a normal hashset.
> Does that mean the following now works?
>
>      1. a+b matches "[" + a + b + "]+"
>      2. b+a matches "[" + a + b + "]+"
>      3. a+b matches "[" + b + a + "]+"
>      4. b+a matches "[" + b + a + "]+"
>
> When a and b take on every Unicode code point, meaning
> from U+0000 up to  U+10FFFF?  If they do not, then one
> is not specifying Unicode code points.
>
> Please correct me if I am wrong, but I believe the following
> code showing how logical code points are *never* mistaken with
> their serialization representations is conforming behaviour--and
> that results other than these would indicate nonconforming behavior:
>
>      $ perl -le 'print "\x{1033c}" =~ /^\x{1033c}$/ || 0'
>      1
>      $ perl -le 'print "\xF0\x90\x8C\xBC" =~ /^\x{1033c}$/ || 0'
>      0
>      $ perl -le 'print "\x{D800}\x{DF3C}" =~ /^\xF0\x90\x8C\xBC$/ || 0'
>      0
>
>      $ perl -le 'print "\x{1033c}" =~ /^\x{D800}\x{DF3c}+$/ || 0'
>      0
>      $ perl -le 'print "\x{1033c}" =~ /^[\x{D800}\x{DF3c}]+$/ || 0'
>      0
>      $ perl -le 'print "\x{1033c}" =~ /^\xF0\x90\x8C\xBC$/ || 0'
>      0
>      $ perl -le 'print "\x{1033c}" =~ /^[\xF0\x90\x8C\xBC]+$/ || 0'
>      0
>
>      $ perl -le 'print "\x{D800}\x{DF3C}" =~ /^[\x{D800}\x{DF3C}]+$/ || 0'
>      1
>      $ perl -le 'print "\x{D800}\x{DF3C}" =~ /^[\x{DF3C}\x{D800}]+$/ || 0'
>      1
>      $ perl -le 'print "\x{DF3C}\x{D800}" =~ /^[\x{D800}\x{DF3C}]+$/ || 0'
>      1
>      $ perl -le 'print "\x{DF3C}\x{D800}" =~ /^[\x{DF3C}\x{D800}]+$/ || 0'
>      1
>
> Can Java do that yet?  If not, then \uXXXX does not meet RL1.1, and one
> appears to need \x{} or its equivalent to do so--with the proviso from
> the top of this message that it must not be double evaluated for meta
> characters: \x{} must always be a literal code point of that number
> without regard to reinterpretation as UTF-16 or as pattern syntax.
>
> I'm sorry if this is too terse.  I do not mean to be in the least bit
> confrontational!  I apologize in advance if sounds that way; I really do
> not intend it.  It is possible that I have a different way of looking at
> regexes than Java folks have historically considered them.  Even if so,
> I believe my way of looking at them accords with tr18's RL1.1 in both
> its letter and its spirit, and that Java's current way fails to meet
> that requirement in either sense.
>
> --tom
>
>      #!/bin/sh
>      # expected results: 1 0 0 0 0 0 0 1 1 1 1
>      perl -le 'print "\x{1033c}" =~ /^\x{1033c}$/ || 0'
>      perl -le 'print "\xF0\x90\x8C\xBC" =~ /^\x{1033c}$/ || 0'
>      perl -le 'print "\x{D800}\x{DF3C}" =~ /^\xF0\x90\x8C\xBC$/ || 0'
>      perl -le 'print "\x{1033c}" =~ /^\x{D800}\x{DF3c}+$/ || 0'
>      perl -le 'print "\x{1033c}" =~ /^[\x{D800}\x{DF3c}]+$/ || 0'
>      perl -le 'print "\x{1033c}" =~ /^\xF0\x90\x8C\xBC$/ || 0'
>      perl -le 'print "\x{1033c}" =~ /^[\xF0\x90\x8C\xBC]+$/ || 0'
>      perl -le 'print "\x{D800}\x{DF3C}" =~ /^[\x{D800}\x{DF3C}]+$/ || 0'
>      perl -le 'print "\x{D800}\x{DF3C}" =~ /^[\x{DF3C}\x{D800}]+$/ || 0'
>      perl -le 'print "\x{DF3C}\x{D800}" =~ /^[\x{D800}\x{DF3C}]+$/ || 0'
>      perl -le 'print "\x{DF3C}\x{D800}" =~ /^[\x{DF3C}\x{D800}]+$/ || 0'



More information about the i18n-dev mailing list