<i18n dev> Codereview Request: 7039066 j.u.rgex does not match TR#18 RL1.4 Simple Word Boundaries and RL1.2 Properties

Xueming Shen xueming.shen at oracle.com
Sun Apr 24 22:59:49 PDT 2011


  Thanks Tom!

The j.u.regex does not have its own direct access to PropList for now, 
have to use the properties from j..l.Character
class. I will have to move those CharacterDateNN classes from the 
java.lang package (package private) to sun.lang
or somewhere that both j.u.Character and j.u.regex can both have access, 
given we are so late in release, I'm trying
to make as less change as possible. Will redo the properties access part 
in JDK8 to (1) provide more useful properties
or all (? :-), maybe with better performance) (2) as you suggested, to 
implement \X, need to access the property that
probably not available via j.l.Character class.

The "Noncharacter_code_point" is not difficult to achieve by simply hard 
coding those code points.

-- "You'll certainly want to somehow make that available for Strings, too;"

Yes, I remember I "joked" with a language guy about the possibility of 
adding the literal syntax for Unicode character
name for String and "char", for example '\u{LATIN_CAPITAL_LETTER_A}', I 
got "haha, interesting",:-) It's going to
be a language spec change.

-Sherman


On 4/24/2011 8:47 PM, Tom Christiansen wrote:
> Xueming, the docs look good.
>
> On the name of the flag, I have no strong feelings one way or the other.
> Perhaps between UNICODE_PROPERTIES and UNICODE_CLASSES, I would prefer
> the second one.  The first makes me think of the regular properties like
> \p{Script=Greek} from RL1.2, not the compat properties from RL1.2a like
> \w and \s.  (I realize that the POSIX-compat stuff overlaps a bit.)
>
> One thing you might want to take a look at longer term is whether
> all 11 properties listed in RL1.2 will be accessible.  I don't
> know whether you have methods in Character for #7 and #8 below;
> I seem to recall seeing one but not the other.
>
>      RL1.2       Properties
>          To meet this requirement, an implementation shall provide
>          at least a minimal list of properties, consisting of the
>          following:
>
>               yes   1  General_Category
>               yes   2  Script
>               yes   3  Alphabetic
>               yes   4  Uppercase
>               yes   5  Lowercase
>               yes   6  White_Space
>                ?    7  Noncharacter_Code_Point
>                ?    8  Default_Ignorable_Code_Point
>                ?    9  ANY
>               yes  10  ASCII
>                ?   11  ASSIGNED
>
>      RL1.2a      Compatibility Properties
>          To meet this requirement, an implementation shall provide the
>          properties listed in Annex C. Compatibility Properties, with the
>          property values as listed there. Such an implementation shall
>          document whether it is using the Standard Recommendation or
>          POSIX-compatible properties.
>
> Other things...
>
> I've thought a bit about whether it's worth pointing out that \x{h..h.}
> is the *only* way to (currently?) get non-BMP code points into a
> bracketed character class, like for example [\x{1D400}-\x{1D419}] to
> mean MATHEMATICAL BOLD CAPITAL A through MATHEMATICAL BOLD CAPITAL Z.
>
> Reasons for not mentioning it include how rarely (I imagine) that users come
> across it, and also because this is one of those rare places in Java where
> you can't treat UTF-16 code units separately and get the same results.
> This matters for interpolation, because you can never build up a character
> class by using [ ] to surround the two different 16-bit char units that
> non-BMP codepoints turn into.  You always have to use the indirect \x{h..h}
> instead.  This is rather non-obvious.
>
> However, after consideration I think it probably not worth risking
> confusing people by talking about it.  I'm just very glad it can now
> be done.  Having to figure out pieces of UTF-16 is no fun.
>
> Once the current effort is done and you've had a well-deserved rest,
> I know you were thinking about \N{NAME} and \X for a future version
> of the JDK.  Both are important, although for quite different reasons.
>
> \N{NAME} is important for helping make regex code more maintainable by
> being self-documenting, since having to putting raw "magic" numbers instead of
> symbolic names in code is always bad.  You'll certainly want to somehow
> make that available for Strings, too; not sure how to do that.  The regex
> string escapes and the Java String escapes have already diverged, and I
> don't know how that happened.  For example, the rules for octal escapes
> differ, and the regex engine supports things that Java proper does not;
> The "\cA" style comes to mind, but I think there are a few others, too.
> And now there is "\x{h..h}" too; pity Strings don't know that one.
>
> \X is important because you really do need access to graphemes.  Otherwise
> it is very difficult to write the equivalent of (?:(?=[aeiou])\X), which
> assuming NFD, will match a grapheme that start with one of those five
> letters.  More importantly, you have to be able to chunk text by graphemes,
> and you need to do this even if you don't someday make a way to tie \b to
> the fancier sense like the ICU (?w) UREGEX_UWORD flag provides.
>
> Getting grapheme clusters right is harder than it might appear.  A
> careful reading of UAX#29 is important.  There are two kinds of grapheme
> clusters, either legacy or extended.  The extended version is tricky to
> get right, especially when you don't have access to all the syllable
> type properties. One problem with the legacy version is that it breaks
> up things that it shouldn't.  We switched to the extended version for
> the 5.12 release of Perl, as this shows:
>
>      $ perl5.10.0 -le 'print "\r\n" =~ /\A\X\z/ ? 1 : 0'
>      0
>
>      $ perl5.12.3 -le 'print "\r\n" =~ /\A\X\z/ ? 1 : 0'
>      1
>
> Which is the way it really needs to be.  For the legacy sense, you can
> always still code that one up more explicitly:
>
>      $ perl5.12.3 -le 'print "\r\n" =~ /\A\p{Grapheme_Base}\p{Grapheme_Extend}*\z/ ? 1 : 0'
>      0
>
> But I don't think people often want that version; extended is much better.
>
> Giving Java access to the properties needed for either version of grapheme
> clusters may be a good time to reconsider whether you might wish to
> redesign how you check whether a code point has any particular Unicode
> property.  There are of course performance issues, but also because the
> current mechanism does not seem easily extended to support the full
> complement of Unicode properties (which is a new Level-2 RL).
>
> So if you are going to widen those again so you can have the properties
> at your fingertips that you need to support grapheme clusters, it might
> be worth thinking whether to refactor now or not.  Performance and the
> size of tables will someday become an issue, if not now.  I do know the
> Unicode docs mention this concern; I also know that we have people
> rethinking how Perl grants access to properties, because even though we
> do give you all of them, it could be done more tidily.
>
> I haven't thought much about what the non-regex interface to graphemes and
> such should look like.  Besides the ICU stuff, something you might want to
> take a look at, just to see the sorts of things others are doing in the
> non-regex grapheme arena, is these two classes:
>
>      http://search.cpan.org/perldoc?Unicode::GCString
>      http://search.cpan.org/perldoc?Unicode::LineBreak
>
> It's Spartan, but it will give you an idea.  I couldn't (well, wouldn't
> *want* to) do East Asian text segmentation without those, which is
> something I've done a bit of lately.  Like all the good 3rd-party Unicode
> modules in Perl, those two come from Asia.  They're often the driving force
> behind progress in Unicode support.  With all those complicated scripts,
> you can certainly see why, too.
>
> Hope this helps!
>
> --tom



More information about the i18n-dev mailing list