<i18n dev> Proposed update to UTS#18
Tom Christiansen
tchrist at perl.com
Fri Apr 15 16:46:18 PDT 2011
I hope you all know there is a lot of handwaving at the end of my last
posting. :) That's because it isn't actually implementable as things stand.
There's no current way to track what was a single grapheme before the regex
gets its hands on it if that regex engine is doing some sort of decomposition.
That means you can overshoot past the bounds of your \X.
You need to keep track of indices/extents of what things were like before
you did whatever decomposition you needed to operate as. If something
requires you to run in NFKD mode, for example, you can't allow a decomposed
ij pair suddenly start counting as two graphemes if it started out as one.
That's what reason why NFD is more attractive. But even NFD and NFC can
reorder marks in ways that make the most natural approach fail. You can't
match out-of-order elements, or leave things hanging. A regex is a set of
sequential rules, which have to be applied sequentially to sequential next.
Logical ordering exceptions are very difficult to deal with.
I think is part of what Mark was referring to.
All of which helps explain why "canonical matching" is not currently
implementable. I think that for UCA1 it isn't trouble because you don't
count Marks (or, I suspect, Grapheme_Extend in general). But above that
things get odder. I may be wrong, though.
Returning to Java, the CANON_EQ Pattern flag doesn't really do what people
expect it to. It doesn't solve any of these problems as far as I can see.
--tom
More information about the i18n-dev
mailing list