<i18n dev> Proposed update to UTS#18
Tom Christiansen
tchrist at perl.com
Fri Apr 15 13:46:18 PDT 2011
Andy Heninger <aheninger at google.com> wrote:
>>> I actually had do this because I have a dataset that has things like
>>> "undeaðlich" nad "smørrebrød", and I wanted to allow the user to
>>> head-match with "undead" and "smor", respectively. There is no
>>> decomposition of "ð" that includes "d", nor any of "ø" that includes "o".
>>> But the UCA primary strenths are the same. It worked very well.
>>
>>> It's a very useful feature, and I'm glad that tr18 includes mention of it.
>>> I just wish we could get it into our regex engines so I didn't have to
>>> do it all by hand. :)
> On Fri, Apr 15, 2011 at 8:01 AM, Mark Davis ☕ <mark at macchiato.com> wrote:
>> The biggest issue is that for any transformation that changes the number of
>> characters, or rearranges them is problematic, for the reasons outlined in
>> the PRI.
>>
>> An example might be /(a|b|c*(?=...)|...)(d|...|a)/, which for Danish (under
>> a collation transform, strength 2) should match any of {aa, aA,...å, Å,
>> Å,...}, as should /(a|b|c*(?=...)|...)(d|...|\x{308})/
>>
>> What *is* relatively straightforward is to do is to construct a regex
>> targeted at a known transformation (like NFC), and then transform the input
>> text. There will be some difficulties in mapping between indexes for
>> grouping, however. Most regex engines can't handle in their API
>> discontiguous groups.
> I suspect a match where the fundamental atomic unit of matching was grapheme
> clusters, or combining sequences, would produce useful results.
> No discontinuous results. Results independent of normalization form, or
> lack of normalization, of the input. No ability of the match to look inside
> of, or partially match, combining sequences.
> I also think that we should avoid making recommendations that haven't been
> implemented and proved to be useful and practical.
I agree we should look at existing practice to see what people have come
up with to see what does, and what does not, work.
Last night I wrote, and then deleted, a great deal of text talking about this,
and the solutions that I in practice had found useful. I decided it was too
long and through it all away. All I really ended up saying is that UCA matches
at collation strengths 1 and 2 had proven useful for me.
There are two issues. One relates to decomposition, the other to UCA comparisons.
Consider two situations. The user will think in graphemes, so I will, too.
By grapheme, I mean a user-perceived "character".
* Case one has the user wanting to match any grapheme starting with an "a".
* Case two has the user wanting to match any grapheme starting with an "a"
but which also has a circumflex.
The first case appears to be reasonably easy. The second case probably does not.
But I believe both are harder than they look.
The obvious thing to do for case one, and the thing we've likely all done, is
to use canonical decomposition. That is "safe" because the number of code points
never changes when you take the NFD of a string.
NFD($string) =~ /(?=a)\X/
or for embedding, then /(?:(?=a)\X)/; that's a "loose" match of an "a"
that works no matter whether it is in NFC or NFD or something else.
In fact, with case folding (for case insensitive matching) it even works
for ANGSTROM SIGN, because that has an NFD that turns into a regular "A".
If you pre-NFD the string, the matching engine doesn't have to account
for NFD-matching. This breaks down in case two, though.
However, even still it is not at all as easy as that, because there are
many user-perceived characters it does not work with. Some of these do
work with a compatibility decomposition, although others do not.
Even when NFKD "works", you now have the problem of one grapheme mapping to
multiple graphemes. Consider:
ẚ 1E9A GC=Ll LATIN SMALL LETTER A WITH RIGHT HALF RING
The NFD of that is the same, because there is no combining half ring.
Instead there is a modifier letter, which is a separate grapheme.
This is two graphemes:
LATIN SMALL LETTER A
MODIFIER LETTER RIGHT HALF RING
Now, you cannot blame the user for not knowing whether Unicode
happens to have an NFD for that which works, versus needing an NFKD.
What do you do about that "a" match again? If you pre-NFKD it,
things don't work at all. Look what happens with the same pattern:
NFKD($string) =~ /(?=a)\X/
Now you match only LATIN SMALL LETTER A, leaving MODIFIER LETTER RIGHT
HALF RING left over and unmatched. So you would have to do the decomposition,
even and especially an NFKD decomposition, in the matching engine itself,
not beforehand. That's because tou need to be able to group as one logical unit
anything that is produced by the decomposition.
NFKD decomposition does allow you to match these:
a FF41 GC=Ll FULLWIDTH LATIN SMALL LETTER A
ª 00AA GC=Ll FEMININE ORDINAL INDICATOR
And even stuff like this:
㏂ 33C2 GC=So SQUARE AM
Although again this makes you wonder what the whole match
would be. If you match a "user-visible character" that
starts with "a", shouldn't you get the rest of that, too?
But there is still stuff that the user will perceive but
which even NFKD won't do for you. That's stuff like these:
æ 00E6 GC=Ll LATIN SMALL LETTER AE
ꜳ A733 GC=Ll LATIN SMALL LETTER AA
ꜵ A735 GC=Ll LATIN SMALL LETTER AO
ꜷ A737 GC=Ll LATIN SMALL LETTER AU
The user probably wants to be able to have those count as "a"
code points. We have no decomposition that will get you there.
This is unlike code points that these, which all decompose
to something with two letters:
ij 0133 GC=Ll LATIN SMALL LIGATURE IJ
ʼn 0149 GC=Ll LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
DZ 01F1 GC=Lu LATIN CAPITAL LETTER DZ
Dz 01F2 GC=Lt LATIN CAPITAL LETTER D WITH SMALL LETTER Z
dz 01F3 GC=Ll LATIN SMALL LETTER DZ
Which reminds me, this one is different:
ʣ 02A3 GC=Ll LATIN SMALL LETTER DZ DIGRAPH
But it seems *very* unlikely that there should be user-perceived
difference between those 01F3 which decomposes to "dz", and 02A3,
which does not.
Only when you use the UCA for matching does this get sorted out.
For example, these all produce the same UCA2 values as "dz" produces:
DZ 01F1 GC=Lu LATIN CAPITAL LETTER DZ
Dz 01F2 GC=Lt LATIN CAPITAL LETTER D WITH SMALL LETTER Z
dz 01F3 GC=Ll LATIN SMALL LETTER DZ
ʣ 02A3 GC=Ll LATIN SMALL LETTER DZ DIGRAPH
and you cannot get there using NFKD. At UCA1, you of course get
all these producing the same sort keys as "dz" produces:
DŽ 01C4 GC=Lu LATIN CAPITAL LETTER DZ WITH CARON
Dž 01C5 GC=Lt LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON
dž 01C6 GC=Ll LATIN SMALL LETTER DZ WITH CARON
DZ 01F1 GC=Lu LATIN CAPITAL LETTER DZ
Dz 01F2 GC=Lt LATIN CAPITAL LETTER D WITH SMALL LETTER Z
dz 01F3 GC=Ll LATIN SMALL LETTER DZ
ʣ 02A3 GC=Ll LATIN SMALL LETTER DZ DIGRAPH
If the user wants to match "dz", any of those should match, even the
last, which you can't get at with NFKD, just with UCA1 and UCA2.
Which brings us around to case two where we'll encounter ordering problems.
* Case two has the user wanting to match any grapheme starting with an "a"
but which also has a circumflex.
You are going to have to use grapheme mode again, of course, but there
are troubles. You can't just say:
NFD($string) =~ /(?=a\N{COMBINING CIRCUMFLEX ACCENT})\X/
or, more readably:
NFD($string) =~ m{
(?= a
\N{COMBINING CIRCUMFLEX ACCENT}
)
\X
}x
because although that will work for super simple cases like
LATIN SMALL LETTER A WITH CIRCUMFLEX
LATIN SMALL LETTER A
COMBINING CIRCUMFLEX ACCENT
it doesn't work when we get something whose combining class causes
an interposition between the "a" and the circumflex. We're safe
with this:
ẫ 1EAB GC=Ll LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE
because that "fortunately" decomposes into
LATIN SMALL LETTER A
COMBINING CIRCUMFLEX ACCENT
COMBINING TILDE
but what if you were looking for an "a" with a tilde?
Also, there are "infinitely" many combinations like this
grapheme:
LATIN SMALL LETTER A WITH TILDE
COMBINING CIRCUMFLEX ACCENT
which decomposes into
LATIN SMALL LETTER A
COMBINING TILDE
COMBINING CIRCUMFLEX ACCENT
And now we have a problem with our pattern. (Because COMBINING TILDE
and COMBINING CIRCUMFLEX ACCENT are of the same combining class, their
ordering matters: eg, "a\x{303}\x{302}" is "ã̂" but "a\x{302}\x{303}"
is "ẫ", which should look different. They are not canonically
equivalent.) In fact, this can
even if you start with a single code point, like
ậ 1EAD GC=Ll LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW
which will canonically decompose into
LATIN SMALL LETTER A
COMBINING DOT BELOW
COMBINING TILDE
so for all those situations, you now need something rather fancier:
NFD($string) =~ m{
(?= a
\p{Grapheme_Extend} *
\N{COMBINING CIRCUMFLEX ACCENT}
)
\X
}x
Yes even that isn't good enough. Consider that all of these
have the same UCA1 as "ae" has:
Æ 00C6 GC=Lu LATIN CAPITAL LETTER AE
æ 00E6 GC=Ll LATIN SMALL LETTER AE
Ǣ 01E2 GC=Lu LATIN CAPITAL LETTER AE WITH MACRON
ǣ 01E3 GC=Ll LATIN SMALL LETTER AE WITH MACRON
Ǽ 01FC GC=Lu LATIN CAPITAL LETTER AE WITH ACUTE
ǽ 01FD GC=Ll LATIN SMALL LETTER AE WITH ACUTE
ᴭ 1D2D GC=Lm MODIFIER LETTER CAPITAL AE
◌ᷔ 1DD4 GC=Mn COMBINING LATIN SMALL LETTER AE
But that's just in non-locale UCA. With the German Phonebook
locale, all these have the same UCA1 as "ae" has:
Ä 00C4 GC=Lu LATIN CAPITAL LETTER A WITH DIAERESIS
Æ 00C6 GC=Lu LATIN CAPITAL LETTER AE
ä 00E4 GC=Ll LATIN SMALL LETTER A WITH DIAERESIS
æ 00E6 GC=Ll LATIN SMALL LETTER AE
Ǟ 01DE GC=Lu LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON
ǟ 01DF GC=Ll LATIN SMALL LETTER A WITH DIAERESIS AND MACRON
Ǣ 01E2 GC=Lu LATIN CAPITAL LETTER AE WITH MACRON
ǣ 01E3 GC=Ll LATIN SMALL LETTER AE WITH MACRON
Ǽ 01FC GC=Lu LATIN CAPITAL LETTER AE WITH ACUTE
ǽ 01FD GC=Ll LATIN SMALL LETTER AE WITH ACUTE
ᴭ 1D2D GC=Lm MODIFIER LETTER CAPITAL AE
◌ᷔ 1DD4 GC=Mn COMBINING LATIN SMALL LETTER AE
So, now how do you match an "a" and a circumflex? You might have
this grapheme:
LATIN SMALL LETTER AE WITH MACRON
COMBINING CIRCUMFLEX ACCENT
Unlike
dz 01F3 GC=Ll LATIN SMALL LETTER DZ
there is no NFKD that gives you access to the contraction.
In this way, LATIN SMALL LETTER AE is like
ʣ 02A3 GC=Ll LATIN SMALL LETTER DZ DIGRAPH
which also has no decomposition (despite being UCA1/UCA2 equiv
to "dz"). So you have to go to the UCA. And you have to modify
your pattern to do something like this, provided you want the
"a" first:
m{
(?= a
\p{Grapheme_Base} *
\p{Grapheme_Extend} *
\N{COMBINING CIRCUMFLEX ACCENT}
)
\X
}x
or like this if you don't care where the a is:
m{
(?=
\p{Grapheme_Base} *
a
\p{Grapheme_Base} *
\p{Grapheme_Extend} *
\N{COMBINING CIRCUMFLEX ACCENT}
)
\X
}x
That presupposes the \X will be able to keep as a whole grapheme
cluster even under NFKD and/or UCA anything that began as a single
grapheme cluster before you began it.
And all this is because we're trying to go at things in ways
that don't surprise the user. If we're doing UCA cleverness
with letters, then we should probably consider doing it with
more than that. Except that the UCA doesn't really consider
all these the same (no surprise). However, with at least
some of them, I'm sure the user might:
^ 005E GC=Sk CIRCUMFLEX ACCENT
ˆ 02C6 GC=Lm MODIFIER LETTER CIRCUMFLEX ACCENT
◌̂ 0302 GC=Mn COMBINING CIRCUMFLEX ACCENT
◌̭ 032D GC=Mn COMBINING CIRCUMFLEX ACCENT BELOW
◌᷍ 1DCD GC=Mn COMBINING DOUBLE CIRCUMFLEX ABOVE
^ FF3E GC=Sk FULLWIDTH CIRCUMFLEX ACCENT
I hope this shows what can be done already, what cannot, and what
is really rather difficult. I think this shows some of the challenges
to meeting users' (perfectly reasonable) expectations of perceived characters.
We really are going to do more in the regex engine, although I'm not
completely certain what yet. I really hate the idea of having to recalculate
decompositions, let alone take UCA keys, again and again.
I'm afraid that one may have no choice, though. Imagine you have
these three strings:
crème brûlée
boîte
château
And you want to be able to support not just a search for
COMBINING CIRCUMFLEX ACCENT
which would get all three, but also
LATIN SMALL LETTER A WITH CIRCUMFLEX
which would get only the last one. As things currently
stand, you have to run every match twice, once on NFD
and once on NFC. I find this troublesome.
Hope this helps!
--tom
More information about the i18n-dev
mailing list