<i18n dev> RL1.1 Hex Notation (part 2 of 3)

Tom Christiansen tchrist at perl.com
Wed Jan 26 11:04:43 PST 2011


Mark wrote:

> The Unicode Standard distinguishes between Unicode Strings (16-bit) and
> UTF-16. In the former, which is often the form used in programming
> languages, a singleton value of 0xD800..0xDFFF is allowed, and is treated
> as if it were a reserved code point.

Ahah!  "Unicode Strings (16-bit)" vs "UTF-16".

That was the subtlety I was missing, because I'm most used to working 
with logical code points, or at worst with well-formed UTF-8.  That's
why I was surprised when your TestRegex() sample turned up no troubles
in the surrogate range.  

I even modified your code to add an INSPAN enum plus this:

    Failures.INSPAN.checkMatch(i, "a[" + hexPattern + "-" + hexPattern + "]b", target);

because I couldn't see how a pattern

    a[\uD800-\uD800]b

could possibly be matched, since those are specifying a span
of code points in the UTF-16 surrogate range.   Yet it can.
My confusion derived from the C1 conformance requirement from 
TUS 6.0.0:

    C1  A process shall not interpret a high-surrogate code point 
	or a low-surrogate code point as an abstract character.

    C2  A process shall not interpret a noncharacter code point 
        as an abstract character.

So I was thinking that allowing a surrogate to match /./ was violating
that.  But that wouldn't make sense, considering that /\p{Cs}/ should 
be a usable property.  Reading further though clear this up somewhat:

    D14 Noncharacter: A code point that is permanently reserved for internal 
        use and that should never be interchanged. Noncharacters consist of 
	the values U+nFFFE and U+nFFFF (where n is from 0 to 1016) and the 
	values U+FDD0..U+FDEF.

    D15 Reserved code point: Any code point of the Unicode Standard that is 
        reserved for future assignment. Also known as an unassigned code point.

        · Surrogate code points and noncharacters are considered assigned
          code points, but not assigned characters.

Also, there's D77's 

     · In the Unicode Standard, specific values of some code units cannot 
       be used to represent an encoded character in isolation.  This 
       restriction applies to isolated surrogate code units in UTF-16 
       and to the bytes 80­FF in UTF-8. [...]

People often think of these "illegal" code points, or "not a character",
but I now see how upon a close reading of The Unicode Standard, that
these reserved code points can occur in data.  After all, if you have to
be able to build up a buffer a UTF-16 code unit at a time, unpaired
surrogates have to exist even temporarily.  As far as I understand it,
reserved characters should not occur in data used for interchange, but
may occur within an application.  

What the various encoders do with these is not always clear or consistent,
although I suspect this is more a library matter rather than an issue with
the Standard itself.  

I therefore withdraw my doubts regarding java.util.regex meeting tr18-13's
RL1.7 requirement:

    RL1.7 Supplementary Code Points

    To meet this requirement, an implementation shall handle the full
    range of Unicode code points, including values from U+FFFF to
    U+10FFFF. In particular, where UTF-16 is used, a sequence consisting
    of a leading surrogate followed by a trailing surrogate shall be
    handled as a single code point in matching.

I do see that way back in tr18-6 of 2002-04-21, the language was clearer:

    While surrogate pairs could be used to identify code points above
    FFFF₁₆, that mechanism is clumsy.  It is much more useful to provide
    specific syntax for specifying Unicode code points [...]

It's a pity some of that earlier language wasn't retained, either for 
RL1.7 — or, more likely, for RL1.1.  It might have made the intent
of RL1.1 more obvious to all readers.

--tom


More information about the i18n-dev mailing list