<i18n dev> RL1.7 Code Points
Tom Christiansen
tchrist at perl.com
Wed Jan 26 12:04:00 PST 2011
On Monday, 24 January 2011 at 14:39:59 +0900,
Masayoshi Okutsu <masayoshi.okutsu at oracle.com> wrote
>>> Are you talking about unpaired surrogates or something else?
>> Yes, I am talking about unpaired surrogates.
> I believe each code unit of UTF-16 gets converted to its code point. So,
> an unpaired surrogate gets converted to a surrogate code point. So, it's
> still processed based on code points?
Apparently so. I misunderstood what constituted proper handling of
unpaired surrogates within the regex engine. That's because I made
incorrect inferences when reading this out of from section 3.2
Conformance Requirements:
http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf
C1 A process shall not interpret a high-surrogate code point
or a low-surrogate code point as an abstract character.
I thought C1 meant to forbid matching an unpaired surrogate with say,
the "." metacharacter, because the "." metacharacter means* an abstract
character, by which I understand it to mean a single code point. I had
not realized that as reserved code points, unpaired surrogates could still be
matched. I had thought them non-characters, not as abstract characters.
That said, I'm still trying to reconcile C1 to all this. I think
--tom
[*] Well, in this interpretation. There are other interpretations
in which dot would match something else. For example, under tr18's
2.2.1 Grapheme Cluster Mode, dot "behaves like \X; that is, matches
a full extended grapheme cluster going forward." From:
http://www.unicode.org/reports/tr18/#Default_Grapheme_Clusters
In Perl5, you have to use \X to get \X. :) However, in
Perl6's grapheme mode, dot matches a language-independent
grapheme. That's the 3rd highest level, just short of
matching language-dependent notions of "characters".
http://perlcabal.org/syn/S05.html#Modifiers
New modifiers specify Unicode level:
m:bytes / .**2 / # match two bytes
m:codes / .**2 / # match two codepoints
m:graphs / .**2 / # match two language-independent graphemes
m:chars / .**2 / # match two characters at current max level
There are corresponding pragmas to default to these levels. Note that
the :chars modifier is always redundant because dot always matches
characters at the highest level allowed in scope. This highest level
may be identical to one of the other three levels, or it may be more
specific than :graphs when a particular language's character rules are
in use. Note that you may not specify language-dependent character
processing without specifying which language you're depending on.
[Conjecture: the :chars modifier could take an argument
specifying which language's rules to use for this match.]
More information about the i18n-dev
mailing list