<i18n dev> RL1.1 Hex Notation

Thu Jan 27 12:48:15 PST 2011

Sherman wrote:

> Oh, I see the problem. Obviously I have been working on jdk7 too long
> and forgot the latest release is still 6:-( There is indeed a bug in
> the previous implementation which I fixed in 7 long time ago (I
> mentioned this in one of the early emails but was not specific, my
> apology), probably should backport to 6 update release asap. The test
> case runs well (the "failures" in literals are expected) 

Could you please elaborate a bit on that?  Code points specified by
value are not to be re-evaluated for pattern-syntax senses ("meta-
ness").  Could you please show one sample string and one sample regex
containing a "\\uXXXX" mention that you expect to fail?  There should 
be no failures at all when doing that.

I tried setting up some smaller tests, but I encountered bugs in the regex
compiler, so I don't trust anything.  The bug I encountered was matching
using the string "*" and the pattern "^*$".  Java fails to detect that is
an invalid regex.  You cannot quantify a zero-width assertion; it should
have raised an exception.  Apparently the compiler is tricked into thinking
that is a literal "*" there.  That's why I don't trust my correctness tests
on literalness.

> on 7 with the following output. I modified your test case "slightly"
> since it appears the UnicodeSet class in our normalizer package does
> not have the size(), replace it with a normal hashset.

Does that mean the following now works?

    1. a+b matches "[" + a + b + "]+"
    2. b+a matches "[" + a + b + "]+"
    3. a+b matches "[" + b + a + "]+"
    4. b+a matches "[" + b + a + "]+"

When a and b take on every Unicode code point, meaning
from U+0000 up to  U+10FFFF?  If they do not, then one
is not specifying Unicode code points.

Please correct me if I am wrong, but I believe the following
code showing how logical code points are *never* mistaken with 
their serialization representations is conforming behaviour--and
that results other than these would indicate nonconforming behavior:

    $ perl -le 'print "\x{1033c}" =~ /^\x{1033c}$/ || 0'
    1
    $ perl -le 'print "\xF0\x90\x8C\xBC" =~ /^\x{1033c}$/ || 0'
    0
    $ perl -le 'print "\x{D800}\x{DF3C}" =~ /^\xF0\x90\x8C\xBC$/ || 0'
    0

    $ perl -le 'print "\x{1033c}" =~ /^\x{D800}\x{DF3c}+$/ || 0'
    0
    $ perl -le 'print "\x{1033c}" =~ /^[\x{D800}\x{DF3c}]+$/ || 0'
    0
    $ perl -le 'print "\x{1033c}" =~ /^\xF0\x90\x8C\xBC$/ || 0'
    0
    $ perl -le 'print "\x{1033c}" =~ /^[\xF0\x90\x8C\xBC]+$/ || 0'
    0

    $ perl -le 'print "\x{D800}\x{DF3C}" =~ /^[\x{D800}\x{DF3C}]+$/ || 0'
    1
    $ perl -le 'print "\x{D800}\x{DF3C}" =~ /^[\x{DF3C}\x{D800}]+$/ || 0'
    1
    $ perl -le 'print "\x{DF3C}\x{D800}" =~ /^[\x{D800}\x{DF3C}]+$/ || 0'
    1
    $ perl -le 'print "\x{DF3C}\x{D800}" =~ /^[\x{DF3C}\x{D800}]+$/ || 0'
    1

Can Java do that yet?  If not, then \uXXXX does not meet RL1.1, and one
appears to need \x{} or its equivalent to do so--with the proviso from 
the top of this message that it must not be double evaluated for meta
characters: \x{} must always be a literal code point of that number 
without regard to reinterpretation as UTF-16 or as pattern syntax.

I'm sorry if this is too terse.  I do not mean to be in the least bit
confrontational!  I apologize in advance if sounds that way; I really do
not intend it.  It is possible that I have a different way of looking at
regexes than Java folks have historically considered them.  Even if so, 
I believe my way of looking at them accords with tr18's RL1.1 in both 
its letter and its spirit, and that Java's current way fails to meet 
that requirement in either sense.

--tom

    #!/bin/sh
    # expected results: 1 0 0 0 0 0 0 1 1 1 1
    perl -le 'print "\x{1033c}" =~ /^\x{1033c}$/ || 0'
    perl -le 'print "\xF0\x90\x8C\xBC" =~ /^\x{1033c}$/ || 0'
    perl -le 'print "\x{D800}\x{DF3C}" =~ /^\xF0\x90\x8C\xBC$/ || 0'
    perl -le 'print "\x{1033c}" =~ /^\x{D800}\x{DF3c}+$/ || 0'
    perl -le 'print "\x{1033c}" =~ /^[\x{D800}\x{DF3c}]+$/ || 0'
    perl -le 'print "\x{1033c}" =~ /^\xF0\x90\x8C\xBC$/ || 0'
    perl -le 'print "\x{1033c}" =~ /^[\xF0\x90\x8C\xBC]+$/ || 0'
    perl -le 'print "\x{D800}\x{DF3C}" =~ /^[\x{D800}\x{DF3C}]+$/ || 0'
    perl -le 'print "\x{D800}\x{DF3C}" =~ /^[\x{DF3C}\x{D800}]+$/ || 0'
    perl -le 'print "\x{DF3C}\x{D800}" =~ /^[\x{D800}\x{DF3C}]+$/ || 0'
    perl -le 'print "\x{DF3C}\x{D800}" =~ /^[\x{DF3C}\x{D800}]+$/ || 0'