<i18n dev> RL1.1 Hex Notation

Mon Jan 24 20:10:56 PST 2011

Tom,

I would not overread this too much:-) There is no reason for the tr#18 
to use any specific
encoding in the specification, it's a perfect choice to simply pick the 
syntax notation that
uses the code point value directly. However I don't think this "sample" 
syntax (or might
be even further interpreted as a "recommendation") prevents the real 
world implementation
from using whatever reasonable notation to achieve the same goal. It is 
the decision of
JSR204 back to jdk1.5 that the Java language is to use pair of utf16 
surrogates as the
notation for the supplementary character. The supplementary character 
support in
j.u.regex is part of the JSR204 specification. I would assume that the 
JSR204 export
group back then believes that the Java Unicode escapes (\unnnn) and the 
pair are good
enough as the notation for all Unicode code points, which I totally 
agree. That said, I still
believe that \x{...} is a nice to have regex construct for people want 
to have a more "direct"
representation in their regex.

-Sherman

On 1-24-2011 19:14 07:14 PM, Tom Christiansen wrote:
> Sherman wrote:
>
>> Introducing in the new perl style \x{...} as the hexadecimal notation
>> appears to be a nice-to-have enhancement (I will file a RFE to put this
>> request in record). But I don't think you can simply deny that the Java
>> Unicode escape sequences for UTF16 is NOT A "mechanism"/notation for
>> specifying any Unicode code point in Java RegEx, in which two
>> consecutive Unicode escapes that represent a legal utf16 surrogate pair
>> are interpreted as the corresponding supplementary code point.
> I realize we've already gone over this, and I think we both agree it isn't
> all that big of a deal, given that it is not altogether impossible under
> the current system and given also that you will file an RFE about it.
> (Plus it's not much code.)
>
> But I've uncovered something in tr18 I hadn't noticed before.  In their
> examples they specifically include a code point from above BMP, U+10450
> SHAVIAN LETTER PEEP.  I now believe it significant that they did *not*
> show this code point using a pair of UTF-16 code units as in \uD801\uDC50,
> that they they instead invented a brand new syntax: \U00010450.
>
> If you look back through the revisions to tr18, you'll see that this was
> specifically added not all that long after Unicode went from 16 bits to
> 21 bits.  It first appeared in revision 7 of tr18, released 2003-05-15:
>
>      http://www.unicode.org/reports/tr18/tr18-7.html#Hex_notation
>
> To me this evidence strongly suggests that they really *do* intend that
> folks with non-BMP code points *not* have to write a pair of surrogates'
> hex values to specify a single logical character in regexes.  If they
> thought two \uXXXX \uXXXX sufficed, they would not have needed to make the
> update that they intentionally put in there for \uXXXXXXXX.  Because they
> did so, I believe surrogate notation is not enough to meet this requirement.
>
> It's just as well that Java can't do \UXXXXXXXX the way Python requires.
> Java can't because its regexes have already adopted the Perl "translation"
> escapes, including \Q and \U, which means \U is already taken.  I say it's
> just as well because I don't like how you'd have to write out all 8 hex
> digits every time (to avoid ambiguity), when in fact you will never need
> them all for any 21-bit code point.  Because \x{XXX} has braces around it,
> it's safe from meaning something else even if there are more hex digits
> immediately afterwards.
>
> --tom
>
>      RL1.1 Hex Notation
>
>      To meet this requirement, an implementation shall supply a mechanism
>      for specifying any Unicode code point (from U+0000 to U+10FFFF).
>
>      A sample notation for listing hex Unicode characters within strings is
>      by prefixing four hex digits with "\u" and prefixing eight hex digits
>      with "\U". This would provide for the following addition:
>
>          <codepoint>  :=<character>
>
>          <codepoint>  := ESCAPE U_SHORT_MARK
>                         HEX_CHAR HEX_CHAR HEX_CHAR HEX_CHAR
>
>          <codepoint>  := ESCAPE U_LONG_MARK
>                         HEX_CHAR HEX_CHAR HEX_CHAR HEX_CHAR
>                         HEX_CHAR HEX_CHAR HEX_CHAR HEX_CHAR
>
>          U_SHORT_MARK := "u"
>          U_LONG_MARK := "U"
>
>      Examples:
>
>          [\u3040-\u309F \u30FC]  Match Hiragana characters, plus prolonged sound sign
>          [\u00B2 \u2082]         Match superscript and subscript 2
>          [a \U00010450]          Match "a" or U+10450 SHAVIAN LETTER PEEP
>
>      Note: instead of [...\u3040...], an alternate syntax
>            is [...\x{3040}...], as in Perl 5.6 and later.
>
>      Note: more advanced regular expression engines can also offer the
>            ability to use the Unicode character name for readability.
>            See 2.5 Name Properties.