<i18n dev> RL1.1 Hex Notation

Mon Jan 24 19:14:20 PST 2011

Sherman wrote:

> Introducing in the new perl style \x{...} as the hexadecimal notation
> appears to be a nice-to-have enhancement (I will file a RFE to put this
> request in record). But I don't think you can simply deny that the Java
> Unicode escape sequences for UTF16 is NOT A "mechanism"/notation for
> specifying any Unicode code point in Java RegEx, in which two
> consecutive Unicode escapes that represent a legal utf16 surrogate pair
> are interpreted as the corresponding supplementary code point.

I realize we've already gone over this, and I think we both agree it isn't
all that big of a deal, given that it is not altogether impossible under
the current system and given also that you will file an RFE about it.
(Plus it's not much code.)

But I've uncovered something in tr18 I hadn't noticed before.  In their
examples they specifically include a code point from above BMP, U+10450
SHAVIAN LETTER PEEP.  I now believe it significant that they did *not*
show this code point using a pair of UTF-16 code units as in \uD801\uDC50,
that they they instead invented a brand new syntax: \U00010450.

If you look back through the revisions to tr18, you'll see that this was
specifically added not all that long after Unicode went from 16 bits to
21 bits.  It first appeared in revision 7 of tr18, released 2003-05-15:

    http://www.unicode.org/reports/tr18/tr18-7.html#Hex_notation

To me this evidence strongly suggests that they really *do* intend that
folks with non-BMP code points *not* have to write a pair of surrogates'
hex values to specify a single logical character in regexes.  If they
thought two \uXXXX \uXXXX sufficed, they would not have needed to make the
update that they intentionally put in there for \uXXXXXXXX.  Because they
did so, I believe surrogate notation is not enough to meet this requirement.

It's just as well that Java can't do \UXXXXXXXX the way Python requires.
Java can't because its regexes have already adopted the Perl "translation"
escapes, including \Q and \U, which means \U is already taken.  I say it's
just as well because I don't like how you'd have to write out all 8 hex
digits every time (to avoid ambiguity), when in fact you will never need
them all for any 21-bit code point.  Because \x{XXX} has braces around it,
it's safe from meaning something else even if there are more hex digits
immediately afterwards.

--tom

    RL1.1 Hex Notation

    To meet this requirement, an implementation shall supply a mechanism
    for specifying any Unicode code point (from U+0000 to U+10FFFF).

    A sample notation for listing hex Unicode characters within strings is
    by prefixing four hex digits with "\u" and prefixing eight hex digits
    with "\U". This would provide for the following addition:

        <codepoint> := <character>

        <codepoint> := ESCAPE U_SHORT_MARK
                       HEX_CHAR HEX_CHAR HEX_CHAR HEX_CHAR

        <codepoint> := ESCAPE U_LONG_MARK
                       HEX_CHAR HEX_CHAR HEX_CHAR HEX_CHAR
                       HEX_CHAR HEX_CHAR HEX_CHAR HEX_CHAR

        U_SHORT_MARK := "u"
        U_LONG_MARK := "U"

    Examples:

        [\u3040-\u309F \u30FC]  Match Hiragana characters, plus prolonged sound sign
        [\u00B2 \u2082]         Match superscript and subscript 2
        [a \U00010450]          Match "a" or U+10450 SHAVIAN LETTER PEEP

    Note: instead of [...\u3040...], an alternate syntax
          is [...\x{3040}...], as in Perl 5.6 and later.

    Note: more advanced regular expression engines can also offer the
          ability to use the Unicode character name for readability.
          See 2.5 Name Properties.