<i18n dev> RL1.1 Hex Notation
Xueming Shen
xueming.shen at oracle.com
Thu Jan 27 15:13:51 PST 2011
I run
public static void main(String[] args) {
test("\uD800\uDF3C", "^\\x{1033c}$");
test("\uD800\uDF3C", "^\\xF0\\x90\\x8C\\xBC$");
test("\uD800\uDF3C", "^\\x{D800}\\x{DF3c}+$");
test("\uD800\uDF3C", "^[\\x{D800}\\x{DF3c}]+$");
test("\uD800\uDF3C", "^\\xF0\\x90\\x8C\\xBC$");
test("\uD800\uDF3C", "^[\\xF0\\x90\\x8C\\xBC]+$");
test("\uD800\uDF3C", "^[\\x{D800}\\x{DF3C}]+$");
test("\uD800\uDF3C", "^[\\x{DF3C}\\x{D800}]+$");
test("\uDF3C\uD800", "^[\\x{D800}\\x{DF3C}]+$");
test("\uDF3C\uD800", "^[\\x{DF3C}\\x{D800}]+$");
}
static void test(String text, String pattern) {
System.out.println(Pattern.matches(pattern, text));
}
It yields
true
false
false
false
false
false
false
false
true
true
The difference is at
test("\uD800\uDF3C", "^[\\x{D800}\\x{DF3C}]+$");
test("\uD800\uDF3C", "^[\\x{DF3C}\\x{D800}]+$");
You can have unpaird surrogate in Java String, but if you have a paired one
you can't say I want them to be two separated "unpaired" surrogates.
Pretty close, right? sure you would need the \x{...} patch:-) as I'm
preparing it at
http://cr.openjdk.java.net/~sherman/7014645/
<http://cr.openjdk.java.net/%7Esherman/7014645/>
Yes,the [\\uhhhh\\ullll] pair inside class is tricky, the implementation
can't tell if you
want paired or unpaired, the current implementation treats them as a paired
surrogates -> a supplementary character. An alternative is to write them as
union [[\\uhhhh][\\ullll]\\uhhhh\\ullll], if you also want to match the
"unpaired"
surrogate in a string
for example
test("\uD800\uDF3C", "^[[\\uD800][\\uDF3C]\\uD800\\uDF3C]+$");
test("\uDF3C\uD800", "^[[\\uD800][\\uDF3C]\\uD800\\uDF3C]+$");
I assume if I can have this \x{...} in (7), we all agree we are done
with RL1.1?:-)
-Sherman
On 01/27/2011 12:48 PM, Tom Christiansen wrote:
>
>
>> on 7 with the following output. I modified your test case "slightly"
>> since it appears the UnicodeSet class in our normalizer package does
>> not have the size(), replace it with a normal hashset.
> Does that mean the following now works?
>
> 1. a+b matches "[" + a + b + "]+"
> 2. b+a matches "[" + a + b + "]+"
> 3. a+b matches "[" + b + a + "]+"
> 4. b+a matches "[" + b + a + "]+"
>
> When a and b take on every Unicode code point, meaning
> from U+0000 up to U+10FFFF? If they do not, then one
> is not specifying Unicode code points.
>
> Please correct me if I am wrong, but I believe the following
> code showing how logical code points are *never* mistaken with
> their serialization representations is conforming behaviour--and
> that results other than these would indicate nonconforming behavior:
>
> $ perl -le 'print "\x{1033c}" =~ /^\x{1033c}$/ || 0'
> 1
> $ perl -le 'print "\xF0\x90\x8C\xBC" =~ /^\x{1033c}$/ || 0'
> 0
> $ perl -le 'print "\x{D800}\x{DF3C}" =~ /^\xF0\x90\x8C\xBC$/ || 0'
> 0
>
> $ perl -le 'print "\x{1033c}" =~ /^\x{D800}\x{DF3c}+$/ || 0'
> 0
> $ perl -le 'print "\x{1033c}" =~ /^[\x{D800}\x{DF3c}]+$/ || 0'
> 0
> $ perl -le 'print "\x{1033c}" =~ /^\xF0\x90\x8C\xBC$/ || 0'
> 0
> $ perl -le 'print "\x{1033c}" =~ /^[\xF0\x90\x8C\xBC]+$/ || 0'
> 0
>
> $ perl -le 'print "\x{D800}\x{DF3C}" =~ /^[\x{D800}\x{DF3C}]+$/ || 0'
> 1
> $ perl -le 'print "\x{D800}\x{DF3C}" =~ /^[\x{DF3C}\x{D800}]+$/ || 0'
> 1
> $ perl -le 'print "\x{DF3C}\x{D800}" =~ /^[\x{D800}\x{DF3C}]+$/ || 0'
> 1
> $ perl -le 'print "\x{DF3C}\x{D800}" =~ /^[\x{DF3C}\x{D800}]+$/ || 0'
> 1
>
> Can Java do that yet? If not, then \uXXXX does not meet RL1.1, and one
> appears to need \x{} or its equivalent to do so--with the proviso from
> the top of this message that it must not be double evaluated for meta
> characters: \x{} must always be a literal code point of that number
> without regard to reinterpretation as UTF-16 or as pattern syntax.
>
> I'm sorry if this is too terse. I do not mean to be in the least bit
> confrontational! I apologize in advance if sounds that way; I really do
> not intend it. It is possible that I have a different way of looking at
> regexes than Java folks have historically considered them. Even if so,
> I believe my way of looking at them accords with tr18's RL1.1 in both
> its letter and its spirit, and that Java's current way fails to meet
> that requirement in either sense.
>
> --tom
>
> #!/bin/sh
> # expected results: 1 0 0 0 0 0 0 1 1 1 1
> perl -le 'print "\x{1033c}" =~ /^\x{1033c}$/ || 0'
> perl -le 'print "\xF0\x90\x8C\xBC" =~ /^\x{1033c}$/ || 0'
> perl -le 'print "\x{D800}\x{DF3C}" =~ /^\xF0\x90\x8C\xBC$/ || 0'
> perl -le 'print "\x{1033c}" =~ /^\x{D800}\x{DF3c}+$/ || 0'
> perl -le 'print "\x{1033c}" =~ /^[\x{D800}\x{DF3c}]+$/ || 0'
> perl -le 'print "\x{1033c}" =~ /^\xF0\x90\x8C\xBC$/ || 0'
> perl -le 'print "\x{1033c}" =~ /^[\xF0\x90\x8C\xBC]+$/ || 0'
> perl -le 'print "\x{D800}\x{DF3C}" =~ /^[\x{D800}\x{DF3C}]+$/ || 0'
> perl -le 'print "\x{D800}\x{DF3C}" =~ /^[\x{DF3C}\x{D800}]+$/ || 0'
> perl -le 'print "\x{DF3C}\x{D800}" =~ /^[\x{D800}\x{DF3C}]+$/ || 0'
> perl -le 'print "\x{DF3C}\x{D800}" =~ /^[\x{DF3C}\x{D800}]+$/ || 0'
More information about the i18n-dev
mailing list