<i18n dev> Proposed update to UTS#18

Fri Apr 15 13:46:18 PDT 2011

Andy Heninger <aheninger at google.com> wrote:

>>> I actually had do this because I have a dataset that has things like
>>> "undeaðlich" nad "smørrebrød", and I wanted to allow the user to
>>> head-match with "undead" and "smor", respectively.  There is no
>>> decomposition of "ð" that includes "d", nor any of "ø" that includes "o".
>>> But the UCA primary strenths are the same.  It worked very well.
>>
>>> It's a very useful feature, and I'm glad that tr18 includes mention of it.
>>> I just wish we could get it into our regex engines so I didn't have to
>>> do it all by hand. :)

> On Fri, Apr 15, 2011 at 8:01 AM, Mark Davis ☕ <mark at macchiato.com> wrote:

>> The biggest issue is that for any transformation that changes the number of
>> characters, or rearranges them is problematic, for the reasons outlined in
>> the PRI.
>>
>> An example might be /(a|b|c*(?=...)|...)(d|...|a)/, which for Danish (under
>> a collation transform, strength 2) should match any of {aa, aA,...å, Å,
>> Å,...}, as should  /(a|b|c*(?=...)|...)(d|...|\x{308})/
>>
>> What *is* relatively straightforward is to do is to construct a regex
>> targeted at a known transformation (like NFC), and then transform the input
>> text. There will be some difficulties in mapping between indexes for
>> grouping, however. Most regex engines can't handle in their API
>> discontiguous groups.

> I suspect a match where the fundamental atomic unit of matching was grapheme
> clusters, or combining sequences, would produce useful results.

> No discontinuous results.  Results independent of normalization form, or
> lack of normalization, of the input.  No ability of the match to look inside
> of, or partially match, combining sequences.

> I also think that we should avoid making recommendations that haven't been
> implemented and proved to be useful and practical.

I agree we should look at existing practice to see what people have come 
up with to see what does, and what does not, work.

Last night I wrote, and then deleted, a great deal of text talking about this, 
and the solutions that I in practice had found useful.  I decided it was too
long and through it all away.  All I really ended up saying is that UCA matches
at collation strengths 1 and 2 had proven useful for me.

There are two issues.  One relates to decomposition, the other to UCA comparisons.

Consider two situations.  The user will think in graphemes, so I will, too.
By grapheme, I mean a user-perceived "character".  

  * Case one has the user wanting to match any grapheme starting with an "a".

  * Case two has the user wanting to match any grapheme starting with an "a"
    but which also has a circumflex.

The first case appears to be reasonably easy.  The second case probably does not.
But I believe both are harder than they look.

The obvious thing to do for case one, and the thing we've likely all done, is 
to use canonical decomposition.  That is "safe" because the number of code points
never changes when you take the NFD of a string.   

    NFD($string) =~ /(?=a)\X/

or for embedding, then /(?:(?=a)\X)/; that's a "loose" match of an "a"
that works no matter whether it is in NFC or NFD or something else.

In fact, with case folding (for case insensitive matching) it even works 
for ANGSTROM SIGN, because that has an NFD that turns into a regular "A".

If you pre-NFD the string, the matching engine doesn't have to account 
for NFD-matching.  This breaks down in case two, though.

However, even still it is not at all as easy as that, because there are
many user-perceived characters it does not work with.  Some of these do
work with a compatibility decomposition, although others do not.  
Even when NFKD "works", you now have the problem of one grapheme mapping to
multiple graphemes.  Consider:

 ẚ  1E9A GC=Ll LATIN SMALL LETTER A WITH RIGHT HALF RING

The NFD of that is the same, because there is no combining half ring.
Instead there is a modifier letter, which is a separate grapheme.
This is two graphemes:

    LATIN SMALL LETTER A
    MODIFIER LETTER RIGHT HALF RING

Now, you cannot blame the user for not knowing whether Unicode
happens to have an NFD for that which works, versus needing an NFKD.
What do you do about that "a" match again?  If you pre-NFKD it,
things don't work at all.  Look what happens with the same pattern:

    NFKD($string) =~ /(?=a)\X/

Now you match only LATIN SMALL LETTER A, leaving MODIFIER LETTER RIGHT 
HALF RING left over and unmatched.  So you would have to do the decomposition,
even and especially an NFKD decomposition, in the matching engine itself,
not beforehand.  That's because tou need to be able to group as one logical unit
anything that is produced by the decomposition.

NFKD decomposition does allow you to match these:

 ａ FF41 GC=Ll FULLWIDTH LATIN SMALL LETTER A
 ª  00AA GC=Ll FEMININE ORDINAL INDICATOR

And even stuff like this:

 ㏂ 33C2 GC=So SQUARE AM

Although again this makes you wonder what the whole match
would be.  If you match a "user-visible character" that 
starts with "a", shouldn't you get the rest of that, too?

But there is still stuff that the user will perceive but 
which even NFKD won't do for you.  That's stuff like these:

 æ  00E6 GC=Ll LATIN SMALL LETTER AE
 ꜳ  A733 GC=Ll LATIN SMALL LETTER AA
 ꜵ  A735 GC=Ll LATIN SMALL LETTER AO
 ꜷ  A737 GC=Ll LATIN SMALL LETTER AU

The user probably wants to be able to have those count as "a" 
code points.  We have no decomposition that will get you there.
This is unlike code points that these, which all decompose 
to something with two letters:

 ĳ  0133 GC=Ll LATIN SMALL LIGATURE IJ
 ŉ  0149 GC=Ll LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
 Ǳ  01F1 GC=Lu LATIN CAPITAL LETTER DZ
 ǲ  01F2 GC=Lt LATIN CAPITAL LETTER D WITH SMALL LETTER Z
 ǳ  01F3 GC=Ll LATIN SMALL LETTER DZ

Which reminds me, this one is different:

 ʣ  02A3 GC=Ll LATIN SMALL LETTER DZ DIGRAPH

But it seems *very* unlikely that there should be user-perceived
difference between those  01F3 which decomposes to "dz", and 02A3, 
which does not.

Only when you use the UCA for matching does this get sorted out.
For example, these all produce the same UCA2 values as "dz" produces:

 Ǳ  01F1 GC=Lu LATIN CAPITAL LETTER DZ
 ǲ  01F2 GC=Lt LATIN CAPITAL LETTER D WITH SMALL LETTER Z
 ǳ  01F3 GC=Ll LATIN SMALL LETTER DZ
 ʣ  02A3 GC=Ll LATIN SMALL LETTER DZ DIGRAPH

and you cannot get there using NFKD.  At UCA1, you of course get 
all these producing the same sort keys as "dz" produces:

 Ǆ  01C4 GC=Lu LATIN CAPITAL LETTER DZ WITH CARON
 ǅ  01C5 GC=Lt LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON
 ǆ  01C6 GC=Ll LATIN SMALL LETTER DZ WITH CARON
 Ǳ  01F1 GC=Lu LATIN CAPITAL LETTER DZ
 ǲ  01F2 GC=Lt LATIN CAPITAL LETTER D WITH SMALL LETTER Z
 ǳ  01F3 GC=Ll LATIN SMALL LETTER DZ
 ʣ  02A3 GC=Ll LATIN SMALL LETTER DZ DIGRAPH

If the user wants to match "dz", any of those should match, even the 
last, which you can't get at with NFKD, just with UCA1 and UCA2.

Which brings us around to case two where we'll encounter ordering problems.

  * Case two has the user wanting to match any grapheme starting with an "a"
    but which also has a circumflex.

You are going to have to use grapheme mode again, of course, but there 
are troubles.  You can't just say:

    NFD($string) =~ /(?=a\N{COMBINING CIRCUMFLEX ACCENT})\X/ 

or, more readably:

    NFD($string) =~ m{
        (?= a 
            \N{COMBINING CIRCUMFLEX ACCENT}
        )
        \X
    }x 

because although that will work for super simple cases like

    LATIN SMALL LETTER A WITH CIRCUMFLEX

    LATIN SMALL LETTER A
    COMBINING CIRCUMFLEX ACCENT

it doesn't work when we get something whose combining class causes
an interposition between the "a" and the circumflex.    We're safe 
with this:

 ẫ  1EAB GC=Ll LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE

because that "fortunately" decomposes into 

    LATIN SMALL LETTER A
    COMBINING CIRCUMFLEX ACCENT
    COMBINING TILDE

but what if you were looking for an "a" with a tilde?  
Also, there are "infinitely" many combinations like this
grapheme:

    LATIN SMALL LETTER A WITH TILDE
    COMBINING CIRCUMFLEX ACCENT

which decomposes into

    LATIN SMALL LETTER A
    COMBINING TILDE
    COMBINING CIRCUMFLEX ACCENT

And now we have a problem with our pattern.  (Because COMBINING TILDE
and COMBINING CIRCUMFLEX ACCENT are of the same combining class, their
ordering matters: eg, "a\x{303}\x{302}" is "ã̂" but "a\x{302}\x{303}"
is "ẫ", which should look different. They are not canonically
equivalent.)  In fact, this can 
even if you start with a single code point, like 

 ậ  1EAD GC=Ll LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW

which will canonically decompose into 

    LATIN SMALL LETTER A
    COMBINING DOT BELOW
    COMBINING TILDE

so for all those situations, you now need something rather fancier:

    NFD($string) =~ m{
        (?= a 
            \p{Grapheme_Extend} *
            \N{COMBINING CIRCUMFLEX ACCENT}
        )
        \X
    }x 

Yes even that isn't good enough.   Consider that all of these
have the same UCA1 as "ae" has:

 Æ  00C6 GC=Lu LATIN CAPITAL LETTER AE
 æ  00E6 GC=Ll LATIN SMALL LETTER AE
 Ǣ  01E2 GC=Lu LATIN CAPITAL LETTER AE WITH MACRON
 ǣ  01E3 GC=Ll LATIN SMALL LETTER AE WITH MACRON
 Ǽ  01FC GC=Lu LATIN CAPITAL LETTER AE WITH ACUTE
 ǽ  01FD GC=Ll LATIN SMALL LETTER AE WITH ACUTE
 ᴭ  1D2D GC=Lm MODIFIER LETTER CAPITAL AE
 ◌ᷔ  1DD4 GC=Mn COMBINING LATIN SMALL LETTER AE

But that's just in non-locale UCA.  With the German Phonebook
locale, all these have the same UCA1 as "ae" has:

 Ä  00C4 GC=Lu LATIN CAPITAL LETTER A WITH DIAERESIS
 Æ  00C6 GC=Lu LATIN CAPITAL LETTER AE
 ä  00E4 GC=Ll LATIN SMALL LETTER A WITH DIAERESIS
 æ  00E6 GC=Ll LATIN SMALL LETTER AE
 Ǟ  01DE GC=Lu LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON
 ǟ  01DF GC=Ll LATIN SMALL LETTER A WITH DIAERESIS AND MACRON
 Ǣ  01E2 GC=Lu LATIN CAPITAL LETTER AE WITH MACRON
 ǣ  01E3 GC=Ll LATIN SMALL LETTER AE WITH MACRON
 Ǽ  01FC GC=Lu LATIN CAPITAL LETTER AE WITH ACUTE
 ǽ  01FD GC=Ll LATIN SMALL LETTER AE WITH ACUTE
 ᴭ  1D2D GC=Lm MODIFIER LETTER CAPITAL AE
 ◌ᷔ  1DD4 GC=Mn COMBINING LATIN SMALL LETTER AE

So, now how do you match an "a" and a circumflex?  You might have
this grapheme:

 LATIN SMALL LETTER AE WITH MACRON
 COMBINING CIRCUMFLEX ACCENT

Unlike 

 ǳ  01F3 GC=Ll LATIN SMALL LETTER DZ

there is no NFKD that gives you access to the contraction. 
In this way, LATIN SMALL LETTER AE is like

 ʣ  02A3 GC=Ll LATIN SMALL LETTER DZ DIGRAPH

which also has no decomposition (despite being UCA1/UCA2 equiv
to "dz").  So you have to go to the UCA.  And you have to modify
your pattern to do something like this, provided you want the 
"a" first:

    m{
        (?= a 
            \p{Grapheme_Base}   *
            \p{Grapheme_Extend} *
            \N{COMBINING CIRCUMFLEX ACCENT}
        )
        \X
    }x 

or like this if you don't care where the a is:

    m{
        (?= 
            \p{Grapheme_Base}   *
            a
            \p{Grapheme_Base}   *
            \p{Grapheme_Extend} *
            \N{COMBINING CIRCUMFLEX ACCENT}
        )
        \X
    }x 

That presupposes the \X will be able to keep as a whole grapheme 
cluster even under NFKD and/or UCA anything that began as a single
grapheme cluster before you began it.

And all this is because we're trying to go at things in ways
that don't surprise the user.  If we're doing UCA cleverness
with letters, then we should probably consider doing it with 
more than that.  Except that the UCA doesn't really consider 
all these the same (no surprise).  However, with at least
some of them, I'm sure the user might:

 ^  005E GC=Sk CIRCUMFLEX ACCENT
 ˆ  02C6 GC=Lm MODIFIER LETTER CIRCUMFLEX ACCENT
 ◌̂  0302 GC=Mn COMBINING CIRCUMFLEX ACCENT
 ◌̭  032D GC=Mn COMBINING CIRCUMFLEX ACCENT BELOW
 ◌᷍  1DCD GC=Mn COMBINING DOUBLE CIRCUMFLEX ABOVE
 ＾ FF3E GC=Sk FULLWIDTH CIRCUMFLEX ACCENT

I hope this shows what can be done already, what cannot, and what
is really rather difficult.   I think this shows some of the challenges
to meeting users' (perfectly reasonable) expectations of perceived characters.

We really are going to do more in the regex engine, although I'm not
completely certain what yet.  I really hate the idea of having to recalculate
decompositions, let alone take UCA keys, again and again. 

I'm afraid that one may have no choice, though.  Imagine you have 
these three strings:

    crème brûlée
    boîte
    château

And you want to be able to support not just a search for

    COMBINING CIRCUMFLEX ACCENT

which would get all three, but also

    LATIN SMALL LETTER A WITH CIRCUMFLEX

which would get only the last one.  As things currently
stand, you have to run every match twice, once on NFD
and once on NFC.  I find this troublesome.

Hope this helps!

--tom