<i18n dev> RL1.1 Hex Notation (part 1 of 3)

Sat Jan 22 19:14:15 PST 2011

Sherman,

Thank you so much for going out of your way to get your message
to me, all despite my broken mailer.  Thanks to your help, I think
I have finally managed to wrestle it into working right.  But that's
what I said last time, too, so we shall we.

> Introducing in the new perl style \x{...} as the hexadecimal notation
> appears to be a nice-to-have enhancement (I will file a RFE to put this
> request in record). 

Thank you very much.  It seems to make things easier on the programmer, and
I do believe it can be done with a minimum of new code, plus without having
any issues of backwards incompatibility.

I also understand how with even the smallest and safest changes there are
always relate software engineering matters things the general public seldom
thinks about but which can can be substantial.  Minimally I'm thinking of
documentation, test suites, bug-database resolution, and of course release
notes.  But people doing developer tools like editors, debuggers, profilers,
code analysis tools, etc. may well have more work to do, too.

> But I don't think you can simply deny that the Java Unicode escape
> sequences for UTF16 is NOT A "mechanism"/notation for specifying any
> Unicode code point in Java RegEx, in which two consecutive Unicode
> escapes that represent a legal utf16 surrogate pair are interpreted as
> the corresponding supplementary code point.

No, I certainly cannot deny that!  

In my original, unpublished analysis of Java's UTS#18 compliance, I even
gave Java a "pass" grade on R1.1 Hex Notation.  Only when I went back to
divide up that very long message into smaller, more easily digestible
pieces did I read the fine print, so to speak, did I change my mind.

> The tr#18 explains the purpose of having the hex notation requirement as
> "The character set used by the regular expression writer may not be
> Unicode, or may not have the ability to input all Unicode code points from
> a keyboard.", as long as the notation mechanism provided by the Java RegEx
> can serve this purse, might not be as perfect/direct in some cases, as you
> prefer to, I would not conclude that Java RegEx can not claim "conformance"
> to the TR.

So it's ok if they *do* have to think about serialization matters?  If
serialized UTF-16 is ok, wouldn't UTF-8 then also be ok?  (Not that I'm
suggesting it!! This is just to see where you sit on this.)  Ok, that's
the next part immediately below.

> Regarding to your comment
> -------------------------------------------------

>> But that is in clear violation of what Level 1 must provide:

>>     Level 1: Basic Unicode Support. At this level, the regular expression
>>     engine provides support for Unicode characters as basic logical units.
>>     (This is independent of the actual serialization of Unicode as UTF-8,
>>     UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE.) This is a minimal level for
>>     useful Unicode support.

> ---------------------------------------------------

> My interpretation of above note is that the in order to claim "basic
> unicode support" the regex engine need to handle each Unicode character as
> a basic logical unit (code point), no matter what its underlying/internal
> representation is. In case of UTF16, which is used by Java String as its
> internal form, it means the regex engine needs to work on surrogate pair
> for supplementary character, instead of treating them as two separate
> surrogates. This is what Java RegEx engine does, in fact the "first thing"
> (after normalizing the pattern, if required) the engine does is to
> "translate" the input regex pattern from String (utf16) into code point
> form in a int[], each int in the array represents a Unicode code point
> vlaue, internally the engine works on code point vlaue. (if you use double
> backslash to by-pass the javac compiler interpretation, the surrogate pair
> to code point conversion will happen a little later at node-tree build
> stage, we might have a bug in earlier releases, but it should have been

Yes, you do. (Or did?)  See my forthcoming reply number 2 of 3.

> fixed in 7, if not jdk6).  So yes, Java RegEx engine works on Unicode code
> point (as the logical unit) not UTF16 code unit.

It was this part which caught my eye upon second reading:

    Level 1: Basic Unicode Support. At this level, the regular expression
    engine provides support for Unicode characters as basic logical units.
    (This is independent of the actual serialization of Unicode as UTF-8,
    UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE.)

My reading of that is that regexes must treat Unicode characters as logical 
units, not as units related to any serialization such as UTF-8 and UTF-16.
It is only when you apply that to this that I see a problem:

    RL1.1 Hex Notation: To meet this requirement, an
    implementation shall supply a mechanism for specifying any
    Unicode code point (from U+0000 to U+10FFFF).

I don't see how the requirements of both those can be met.  If it
is still conformant to make people enter hex values that are not
the code points themselves but rather those that come from this
or that serialization, then they are not dealing with Unicode
characters as logical units.  

I don't believe it is right to make users have to worry about serialization
matters.  I don't understand how specifying individual UTF-16 code units
instead of logical code points is meets the requirement of not having to
deal with serialization issues.  That would be like having a regex language
that required you to enter UTF-8 just because it used UTF-8 internally,
like "\xC3\xA9" for U+E8 or "\xF0\x9F\x91\xBE" for U+1F47E.  I don't think
that would be OK; do you?  Honestly, it seems to me that this is the very
thing that the standard is trying to guard against, and I don't see how it
should be different for UTF-16 than it is for UTF-8.

I could certainly be wrong about my reading of all this.  Even though their
authors try to cover all the bases, standards documents are notoriously
difficult to understand with complete clarity.  It may turn out that this
is one of those places where they weren't clear enough for there to be just
one single, unambiguous reading.  Or, as I said, I might just be wrong.

This message is already too long.  I'll continue it presently.
Thank you again very, very much for all your time and and work
expertise, Sherman.

--tom