<i18n dev> RL1.7 Code Points

Tom Christiansen tchrist at perl.com
Wed Jan 26 12:04:00 PST 2011


On Monday, 24 January 2011 at 14:39:59 +0900, 
Masayoshi Okutsu <masayoshi.okutsu at oracle.com> wrote 

>>> Are you talking about unpaired surrogates or something else?

>> Yes, I am talking about unpaired surrogates.

> I believe each code unit of UTF-16 gets converted to its code point. So, 
> an unpaired surrogate gets converted to a surrogate code point. So, it's 
> still processed based on code points?

Apparently so.  I misunderstood what constituted proper handling of
unpaired surrogates within the regex engine.  That's because I made
incorrect inferences when reading this out of from section 3.2 
Conformance Requirements:

    http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf

    C1  A process shall not interpret a high-surrogate code point 
        or a low-surrogate code point as an abstract character.

I thought C1 meant to forbid matching an unpaired surrogate with say, 
the "." metacharacter, because the "." metacharacter means* an abstract
character, by which I understand it to mean a single code point.  I had 
not realized that as reserved code points, unpaired surrogates could still be
matched.  I had thought them non-characters, not as abstract characters.

That said, I'm still trying to reconcile C1 to all this.  I think 

--tom

[*] Well, in this interpretation.  There are other interpretations
    in which dot would match something else.  For example, under tr18's
    2.2.1 Grapheme Cluster Mode, dot "behaves like \X; that is, matches 
    a full extended grapheme cluster going forward."  From:

	http://www.unicode.org/reports/tr18/#Default_Grapheme_Clusters

    In Perl5, you have to use \X to get \X. :)  However, in
    Perl6's grapheme mode, dot matches a language-independent
    grapheme.  That's the 3rd highest level, just short of
    matching language-dependent notions of "characters".

	http://perlcabal.org/syn/S05.html#Modifiers

	New modifiers specify Unicode level: 

	     m:bytes  / .**2 /       # match two bytes
	     m:codes  / .**2 /       # match two codepoints
	     m:graphs / .**2 /       # match two language-independent graphemes
	     m:chars  / .**2 /       # match two characters at current max level

        There are corresponding pragmas to default to these levels. Note that
        the :chars modifier is always redundant because dot always matches
        characters at the highest level allowed in scope. This highest level
        may be identical to one of the other three levels, or it may be more
        specific than :graphs when a particular language's character rules are
        in use. Note that you may not specify language-dependent character
        processing without specifying which language you're depending on.

        [Conjecture: the :chars modifier could take an argument
         specifying which language's rules to use for this match.]


More information about the i18n-dev mailing list