<i18n dev> Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties

Tom Christiansen tchrist at perl.com
Sun Apr 24 20:47:50 PDT 2011


Xueming, the docs look good.  

On the name of the flag, I have no strong feelings one way or the other.
Perhaps between UNICODE_PROPERTIES and UNICODE_CLASSES, I would prefer
the second one.  The first makes me think of the regular properties like
\p{Script=Greek} from RL1.2, not the compat properties from RL1.2a like
\w and \s.  (I realize that the POSIX-compat stuff overlaps a bit.)

One thing you might want to take a look at longer term is whether 
all 11 properties listed in RL1.2 will be accessible.  I don't
know whether you have methods in Character for #7 and #8 below;
I seem to recall seeing one but not the other.

    RL1.2       Properties
        To meet this requirement, an implementation shall provide
        at least a minimal list of properties, consisting of the
        following:

             yes   1  General_Category
             yes   2  Script
             yes   3  Alphabetic
             yes   4  Uppercase
             yes   5  Lowercase
             yes   6  White_Space
              ?    7  Noncharacter_Code_Point
              ?    8  Default_Ignorable_Code_Point
              ?    9  ANY
             yes  10  ASCII
              ?   11  ASSIGNED

    RL1.2a      Compatibility Properties
        To meet this requirement, an implementation shall provide the
        properties listed in Annex C. Compatibility Properties, with the
        property values as listed there. Such an implementation shall
        document whether it is using the Standard Recommendation or
        POSIX-compatible properties.

Other things...

I've thought a bit about whether it's worth pointing out that \x{h..h.} 
is the *only* way to (currently?) get non-BMP code points into a 
bracketed character class, like for example [\x{1D400}-\x{1D419}] to 
mean MATHEMATICAL BOLD CAPITAL A through MATHEMATICAL BOLD CAPITAL Z.

Reasons for not mentioning it include how rarely (I imagine) that users come
across it, and also because this is one of those rare places in Java where
you can't treat UTF-16 code units separately and get the same results.
This matters for interpolation, because you can never build up a character
class by using [ ] to surround the two different 16-bit char units that
non-BMP codepoints turn into.  You always have to use the indirect \x{h..h}
instead.  This is rather non-obvious.

However, after consideration I think it probably not worth risking
confusing people by talking about it.  I'm just very glad it can now 
be done.  Having to figure out pieces of UTF-16 is no fun.

Once the current effort is done and you've had a well-deserved rest, 
I know you were thinking about \N{NAME} and \X for a future version 
of the JDK.  Both are important, although for quite different reasons.

\N{NAME} is important for helping make regex code more maintainable by
being self-documenting, since having to putting raw "magic" numbers instead of
symbolic names in code is always bad.  You'll certainly want to somehow
make that available for Strings, too; not sure how to do that.  The regex
string escapes and the Java String escapes have already diverged, and I
don't know how that happened.  For example, the rules for octal escapes
differ, and the regex engine supports things that Java proper does not;
The "\cA" style comes to mind, but I think there are a few others, too.
And now there is "\x{h..h}" too; pity Strings don't know that one.

\X is important because you really do need access to graphemes.  Otherwise
it is very difficult to write the equivalent of (?:(?=[aeiou])\X), which
assuming NFD, will match a grapheme that start with one of those five
letters.  More importantly, you have to be able to chunk text by graphemes,
and you need to do this even if you don't someday make a way to tie \b to
the fancier sense like the ICU (?w) UREGEX_UWORD flag provides.

Getting grapheme clusters right is harder than it might appear.  A
careful reading of UAX#29 is important.  There are two kinds of grapheme
clusters, either legacy or extended.  The extended version is tricky to
get right, especially when you don't have access to all the syllable
type properties. One problem with the legacy version is that it breaks
up things that it shouldn't.  We switched to the extended version for
the 5.12 release of Perl, as this shows:

    $ perl5.10.0 -le 'print "\r\n" =~ /\A\X\z/ ? 1 : 0'
    0

    $ perl5.12.3 -le 'print "\r\n" =~ /\A\X\z/ ? 1 : 0'
    1

Which is the way it really needs to be.  For the legacy sense, you can
always still code that one up more explicitly:

    $ perl5.12.3 -le 'print "\r\n" =~ /\A\p{Grapheme_Base}\p{Grapheme_Extend}*\z/ ? 1 : 0'
    0

But I don't think people often want that version; extended is much better.

Giving Java access to the properties needed for either version of grapheme
clusters may be a good time to reconsider whether you might wish to
redesign how you check whether a code point has any particular Unicode
property.  There are of course performance issues, but also because the
current mechanism does not seem easily extended to support the full
complement of Unicode properties (which is a new Level-2 RL).  

So if you are going to widen those again so you can have the properties
at your fingertips that you need to support grapheme clusters, it might
be worth thinking whether to refactor now or not.  Performance and the
size of tables will someday become an issue, if not now.  I do know the
Unicode docs mention this concern; I also know that we have people
rethinking how Perl grants access to properties, because even though we
do give you all of them, it could be done more tidily.

I haven't thought much about what the non-regex interface to graphemes and
such should look like.  Besides the ICU stuff, something you might want to
take a look at, just to see the sorts of things others are doing in the
non-regex grapheme arena, is these two classes:

    http://search.cpan.org/perldoc?Unicode::GCString
    http://search.cpan.org/perldoc?Unicode::LineBreak

It's Spartan, but it will give you an idea.  I couldn't (well, wouldn't
*want* to) do East Asian text segmentation without those, which is
something I've done a bit of lately.  Like all the good 3rd-party Unicode
modules in Perl, those two come from Asia.  They're often the driving force
behind progress in Unicode support.  With all those complicated scripts,
you can certainly see why, too.

Hope this helps!

--tom


More information about the i18n-dev mailing list