<i18n dev> Proposed update to UTS#18

Mark Davis ☕ mark at macchiato.com
Fri Apr 15 08:01:22 PDT 2011


The biggest issue is that for any transformation that changes the number of
characters, or rearranges them is problematic, for the reasons outlined in
the PRI.

An example might be /(a|b|c*(?=...)|...)(d|...|a)/, which for Danish (under
a collation tranform, stength 2) should match any of {aa, aA,...å, Å,
Å,...}, as should  /(a|b|c*(?=...)|...)(d|...|\x{308})/

What *is* relatively straightforward is to do is to construct a regex
targeted at a known transformation (like NFC), and then transform the input
text. There will be some difficulties in mapping between indexes for
grouping, however. Most regex engines can't handle in their API
discontiguous groups.

Mark

*— Il meglio è l’inimico del bene —*


On Thu, Apr 14, 2011 at 23:50, Tom Christiansen <tchrist at perl.com> wrote:

> Thanks, Mark.
>
> I've been trying to think about what to say to it.
>
> I'd like to more about what is planned in the "canonical matching" area.
> I do understand why reordering makes exact matching impossible.  However,
> I should think one of several sort of loose matching might still be done.
> Maybe that require level 3, though.
>
> Mostly though I've been thinking about case insensitivitity.  I feel that
> the current Unicode case mapping strategy is much weaker than what the
> spirit of the thing really calls far.  It's weak because it doesn't do as
> much as it could.
>
> I have played around with one approach that gives user-desirable results,
> and also addresses the canonical issue.  The synopsis is that I think RL3.4
> would cut the Gordian Knot of combining marks (at level 1 they're ignored)
> and do something genuinely useful by creating much more the sort of case
> insensitivity at a level 1 comparison than anything currently available.
>
> That's what RL3.4 Tailored Loose Match is about:
>
>    To meet this requirement, an implementation shall provide for loose
>    matches based on a locale's collation order, with at least 3 levels.
>
> And tr10's section 8 on Searching and Matching and 8.1 Collation Folding
> also talks about these things.
>
>    Matching can be done by using the collation elements, directly, as
>    discussed above. However, because matching does not use any of the
>    ordering information, the same result can be achieved by a folding.
>    That is, two strings would fold to the same string if and only if they
>    would match according to the (tailored) collation. For example, a
>    folding for a Danish collation would map both "Gård" and "gaard" to
>    the same value. A folding for a primary-strength folding would map
>    "Resume" and "résumé" to the same value. That folded value is
>    typically a lowercase string, such as "resume".
>
> I actually had do this because I have a dataset that has things like
> "undeaðlich" nad "smørrebrød", and I wanted to allow the user to
> head-match with "undead" and "smor", respectively.  There is no
> decomposition of "ð" that includes "d", nor any of "ø" that includes "o".
> But the UCA primary strenths are the same.  It worked very well.
>
> It's a very useful feature, and I'm glad that tr18 includes mention of it.
> I just wish we could get it into our regex engines so I didn't have to
> do it all by hand. :)
>
> -tom
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/i18n-dev/attachments/20110415/b0752d42/attachment-0001.html 


More information about the i18n-dev mailing list