<i18n dev> RL1.4 Simple Word Boundaries (actually, RL1.2 & RL1.2a)

Tom Christiansen tchrist at perl.com
Mon Jan 24 08:50:48 PST 2011


Sherman wrote:

> Thanks for the detailed and excellent "reality check". While I'm still
> going through all the details it appears that the fact the current
> Java Unicode property data does not include the properties defined in
> PropList.txt (current implementation reads the property data only from
> UnicodeData.txt, Scripts, Blocks and SpecialCasing.txt,) contributes
> to lots of issues raised, which means property data of
> Other_Alphabetic/Lowercse/Uppercase and White_Space are not available
> for j.u.regex and j.l.Character.

Ahah, *finally* I begin to understand!  

If you read reading property values from nothing but those four files
alone, that now explains why most Unicode properties are missing in Java
regexes.  *That's* why it always seems that Java only supports the UCD
3.0 properties, plus Blocks and now Scripts added since then.

I suspect no one has ever taken a *really* good look at tr18,
tr44, and the layout of the UCD since j.u.regex was first written.
Could this perhaps be possible?  It would explain much.  I'm sure
you looked at the script stuff, but there's a lot more happening 
in the UCD these days then there was back when j.u.regex was written.

Unicode 6.0.0 has 112 properties but as far as I can tell, j.u.regex
supports only 3 of those: General_Category, Script, and Block.  Not all
112 are critical, or even generally useful, but at least four of them
*are* on tr18's RL1.2 list of required properties, plus a few more if
you count RL1.2a.  So it is very important that they be there.

PropList.txt governs the following properties, not all of which are binary:

    ASCII_Hex_Digit          Join_Control                        Other_Uppercase
    Bidi_Control             Logical_Order_Exception             Pattern_Syntax
    Dash                     Noncharacter_Code_Point             Pattern_White_Space
    Deprecated               Other_Alphabetic                    Quotation_Mark
    Diacritic                Other_Default_Ignorable_Code_Point  Soft_Dotted
    Extender                 Other_Grapheme_Extend               STerm
    Hyphen                   Other_ID_Continue                   Terminal_Punctuation
    Ideographic              Other_ID_Start                      Unified_Ideograph
    IDS_Binary_Operator      Other_Lowercase                     Variation_Selector
    IDS_Trinary_Operator     Other_Math                          White_Space

Some of those properties listed above are then used to help
establish these from DerivedCoreProperties.txt:

    Alphabetic               Changes_When_Lowercased       Grapheme_Extend  Math
    Cased                    Changes_When_Titlecased       Grapheme_Link    Uppercase
    Case_Ignorable           Changes_When_Uppercased       ID_Continue      XID_Continue
    Changes_When_Casefolded  Default_Ignorable_Code_Point  ID_Start         XID_Start
    Changes_When_Casemapped  Grapheme_Base                 Lowercase

Not counting the sets of General_Category=XXX and Script=XXX properties,
those properties above probably include the most important ones--although
there are many more.  The PropertyAliases.txt file contains the list of
*all* top-level Unicode property names and their short-cut aliases.  There
are 112 official properties of Unicode 6.0.0, and many of these are
populated using files other than the 4 that you mention.  I include the
list of these in their longest aliases at the bottom of this message.

The reason Perl handles all official Unicode properties is because it
employs a very elaborate build system that generates not only all the
tables needed, but also documentation and test cases.  To set up its
property tables, at build time Perl processes all of the *.txt,
extracted/*.txt, and auxiliary/*.txt files from the Unicode Character
Database.  These are in the lib/unicore/ subdirectory of Perl's top-level
source directory if you're interested.  It does this using the mktables
script, also located in lib/unicore.  Perl's build ignores provisional-only
properties so people don't get used to something that may go away, but
handles all the rest of them.

The mktables program is large and fairly complex, although well structured
into a set of co-operating packages and classes (all in the same file!) and
very well documented.  I include at the end of this message an excerpt of
the internal documentation from mktables that explains its overall approach.

> j.u.regex is trying the "closest" possible set for the alphabetic,
> lower/uppercase,

I see now: you just don't have any better data available to you for
this. There is little you can to about that until that data should
become available to/from Java.  Once it does though, the rest should
follow pretty directly.  But it's not at all a small issue that's easily
patched up.  It will require some serious design and testing.  It would
be good goal for JDK8, I think.

> I will file a RFE to trace this issue.

Thank you very very much.  I will answer the other half of your message,
the part about RL1.4, later on today.  Hope this helps!

--tom

Here are all 112 official Unicode 6.0.0 properties.  Some are intended to
be "internal only" because used to generate other higher-level properties
(like Other_Alphabetic used to help generate Alphabetic), while a few have
been deprecated (like the legacy binary Hyphen property replaced by the
more fine-grained Word_Break=XXX properties):

    Age                           General_Category         Other_Alphabetic
    Alphabetic                    Grapheme_Base            Other_Default_Ignorable_Code_Point
    ASCII_Hex_Digit               Grapheme_Cluster_Break   Other_Grapheme_Extend
    Bidi_Class                    Grapheme_Extend          Other_ID_Continue
    Bidi_Control                  Grapheme_Link            Other_ID_Start
    Bidi_Mirrored                 Hangul_Syllable_Type     Other_Lowercase
    Bidi_Mirroring_Glyph          Hex_Digit                Other_Math
    Block                         Hyphen                   Other_Uppercase
    Canonical_Combining_Class     ID_Continue              Pattern_Syntax
    Cased                         Ideographic              Pattern_White_Space
    Case_Folding                  IDS_Binary_Operator      Quotation_Mark
    Case_Ignorable                ID_Start                 Radical
    Changes_When_Casefolded       IDS_Trinary_Operator     Script
    Changes_When_Casemapped       ISO_Comment              Sentence_Break
    Changes_When_Lowercased       Jamo_Short_Name          Simple_Case_Folding
    Changes_When_NFKC_Casefolded  Join_Control             Simple_Lowercase_Mapping
    Changes_When_Titlecased       Joining_Group            Simple_Titlecase_Mapping
    Changes_When_Uppercased       Joining_Type             Simple_Uppercase_Mapping
    Composition_Exclusion         Line_Break               Soft_Dotted
    Dash                          Logical_Order_Exception  STerm
    Decomposition_Mapping         Lowercase                Terminal_Punctuation
    Decomposition_Type            Lowercase_Mapping        Titlecase_Mapping
    Default_Ignorable_Code_Point  Math                     Unicode_1_Name
    Deprecated                    Name                     Unicode_Radical_Stroke
    Diacritic                     Name_Alias               Unified_Ideograph
    East_Asian_Width              NFC_Quick_Check          Uppercase
    Expands_On_NFC                NFD_Quick_Check          Uppercase_Mapping
    Expands_On_NFD                NFKC_Casefold            Variation_Selector
    Expands_On_NFKC               NFKC_Quick_Check         White_Space
    Expands_On_NFKD               NFKD_Quick_Check         Word_Break
    Extender                      Noncharacter_Code_Point  XID_Continue
    FC_NFKC_Closure               Numeric_Type             XID_Start
    Full_Composition_Exclusion    Numeric_Value

The guts of the mktables program's algorithm are explained here:

    # This program works on all non-provisional properties as of 6.0, though the
    # files for some are suppressed from apparent lack of demand for them.  You
    # can change which are output by changing lists in this program.
    #
    # The old version of mktables emphasized the term "Fuzzy" to mean Unicode's
    # loose matchings rules (from Unicode TR18):
    #
    #    The recommended names for UCD properties and property values are in
    #    PropertyAliases.txt [Prop] and PropertyValueAliases.txt
    #    [PropValue]. There are both abbreviated names and longer, more
    #    descriptive names. It is strongly recommended that both names be
    #    recognized, and that loose matching of property names be used,
    #    whereby the case distinctions, whitespace, hyphens, and underbar
    #    are ignored.
    #
    # The program still allows Fuzzy to override its determination of if loose
    # matching should be used, but it isn't currently used, as it is no longer
    # needed; the calculations it makes are good enough.
    #
    # SUMMARY OF HOW IT WORKS:
    #   Each file on the list is processed in a loop, using the associated handler
    #   code for each:
    #        The PropertyAliases.txt and PropValueAliases.txt files are processed
    #            first.  These files name the properties and property values.
    #            Objects are created of all the property and property value names
    #            that the rest of the input should expect, including all synonyms.
    #        The other input files give mappings from properties to property
    #           values.  That is, they list code points and say what the mapping
    #           is under the given property.  Some files give the mappings for
    #           just one property; and some for many.  This program goes through
    #           each file and populates the properties from them.  Some properties
    #           are listed in more than one file, and Unicode has set up a
    #           precedence as to which has priority if there is a conflict.  Thus
    #           the order of processing matters, and this program handles the
    #           conflict possibility by processing the overriding input files
    #           last, so that if necessary they replace earlier values.
    #        After this is all done, the program creates the property mappings not
    #            furnished by Unicode, but derivable from what it does give.
    #        The tables of code points that match each property value in each
    #            property that is accessible by regular expressions are created.
    #        The Perl-defined properties are created and populated.  Many of these
    #            require data determined from the earlier steps
    #        Any Perl-defined synonyms are created, and name clashes between Perl
    #            and Unicode are reconciled and warned about.
    #        All the properties are written to files
    #        Any other files are written, and final warnings issued.


More information about the i18n-dev mailing list