<i18n dev> RL1.4 Simple Word Boundaries (actually, RL1.2 & RL1.2a)
Tom Christiansen
tchrist at perl.com
Mon Jan 24 08:50:48 PST 2011
Sherman wrote:
> Thanks for the detailed and excellent "reality check". While I'm still
> going through all the details it appears that the fact the current
> Java Unicode property data does not include the properties defined in
> PropList.txt (current implementation reads the property data only from
> UnicodeData.txt, Scripts, Blocks and SpecialCasing.txt,) contributes
> to lots of issues raised, which means property data of
> Other_Alphabetic/Lowercse/Uppercase and White_Space are not available
> for j.u.regex and j.l.Character.
Ahah, *finally* I begin to understand!
If you read reading property values from nothing but those four files
alone, that now explains why most Unicode properties are missing in Java
regexes. *That's* why it always seems that Java only supports the UCD
3.0 properties, plus Blocks and now Scripts added since then.
I suspect no one has ever taken a *really* good look at tr18,
tr44, and the layout of the UCD since j.u.regex was first written.
Could this perhaps be possible? It would explain much. I'm sure
you looked at the script stuff, but there's a lot more happening
in the UCD these days then there was back when j.u.regex was written.
Unicode 6.0.0 has 112 properties but as far as I can tell, j.u.regex
supports only 3 of those: General_Category, Script, and Block. Not all
112 are critical, or even generally useful, but at least four of them
*are* on tr18's RL1.2 list of required properties, plus a few more if
you count RL1.2a. So it is very important that they be there.
PropList.txt governs the following properties, not all of which are binary:
ASCII_Hex_Digit Join_Control Other_Uppercase
Bidi_Control Logical_Order_Exception Pattern_Syntax
Dash Noncharacter_Code_Point Pattern_White_Space
Deprecated Other_Alphabetic Quotation_Mark
Diacritic Other_Default_Ignorable_Code_Point Soft_Dotted
Extender Other_Grapheme_Extend STerm
Hyphen Other_ID_Continue Terminal_Punctuation
Ideographic Other_ID_Start Unified_Ideograph
IDS_Binary_Operator Other_Lowercase Variation_Selector
IDS_Trinary_Operator Other_Math White_Space
Some of those properties listed above are then used to help
establish these from DerivedCoreProperties.txt:
Alphabetic Changes_When_Lowercased Grapheme_Extend Math
Cased Changes_When_Titlecased Grapheme_Link Uppercase
Case_Ignorable Changes_When_Uppercased ID_Continue XID_Continue
Changes_When_Casefolded Default_Ignorable_Code_Point ID_Start XID_Start
Changes_When_Casemapped Grapheme_Base Lowercase
Not counting the sets of General_Category=XXX and Script=XXX properties,
those properties above probably include the most important ones--although
there are many more. The PropertyAliases.txt file contains the list of
*all* top-level Unicode property names and their short-cut aliases. There
are 112 official properties of Unicode 6.0.0, and many of these are
populated using files other than the 4 that you mention. I include the
list of these in their longest aliases at the bottom of this message.
The reason Perl handles all official Unicode properties is because it
employs a very elaborate build system that generates not only all the
tables needed, but also documentation and test cases. To set up its
property tables, at build time Perl processes all of the *.txt,
extracted/*.txt, and auxiliary/*.txt files from the Unicode Character
Database. These are in the lib/unicore/ subdirectory of Perl's top-level
source directory if you're interested. It does this using the mktables
script, also located in lib/unicore. Perl's build ignores provisional-only
properties so people don't get used to something that may go away, but
handles all the rest of them.
The mktables program is large and fairly complex, although well structured
into a set of co-operating packages and classes (all in the same file!) and
very well documented. I include at the end of this message an excerpt of
the internal documentation from mktables that explains its overall approach.
> j.u.regex is trying the "closest" possible set for the alphabetic,
> lower/uppercase,
I see now: you just don't have any better data available to you for
this. There is little you can to about that until that data should
become available to/from Java. Once it does though, the rest should
follow pretty directly. But it's not at all a small issue that's easily
patched up. It will require some serious design and testing. It would
be good goal for JDK8, I think.
> I will file a RFE to trace this issue.
Thank you very very much. I will answer the other half of your message,
the part about RL1.4, later on today. Hope this helps!
--tom
Here are all 112 official Unicode 6.0.0 properties. Some are intended to
be "internal only" because used to generate other higher-level properties
(like Other_Alphabetic used to help generate Alphabetic), while a few have
been deprecated (like the legacy binary Hyphen property replaced by the
more fine-grained Word_Break=XXX properties):
Age General_Category Other_Alphabetic
Alphabetic Grapheme_Base Other_Default_Ignorable_Code_Point
ASCII_Hex_Digit Grapheme_Cluster_Break Other_Grapheme_Extend
Bidi_Class Grapheme_Extend Other_ID_Continue
Bidi_Control Grapheme_Link Other_ID_Start
Bidi_Mirrored Hangul_Syllable_Type Other_Lowercase
Bidi_Mirroring_Glyph Hex_Digit Other_Math
Block Hyphen Other_Uppercase
Canonical_Combining_Class ID_Continue Pattern_Syntax
Cased Ideographic Pattern_White_Space
Case_Folding IDS_Binary_Operator Quotation_Mark
Case_Ignorable ID_Start Radical
Changes_When_Casefolded IDS_Trinary_Operator Script
Changes_When_Casemapped ISO_Comment Sentence_Break
Changes_When_Lowercased Jamo_Short_Name Simple_Case_Folding
Changes_When_NFKC_Casefolded Join_Control Simple_Lowercase_Mapping
Changes_When_Titlecased Joining_Group Simple_Titlecase_Mapping
Changes_When_Uppercased Joining_Type Simple_Uppercase_Mapping
Composition_Exclusion Line_Break Soft_Dotted
Dash Logical_Order_Exception STerm
Decomposition_Mapping Lowercase Terminal_Punctuation
Decomposition_Type Lowercase_Mapping Titlecase_Mapping
Default_Ignorable_Code_Point Math Unicode_1_Name
Deprecated Name Unicode_Radical_Stroke
Diacritic Name_Alias Unified_Ideograph
East_Asian_Width NFC_Quick_Check Uppercase
Expands_On_NFC NFD_Quick_Check Uppercase_Mapping
Expands_On_NFD NFKC_Casefold Variation_Selector
Expands_On_NFKC NFKC_Quick_Check White_Space
Expands_On_NFKD NFKD_Quick_Check Word_Break
Extender Noncharacter_Code_Point XID_Continue
FC_NFKC_Closure Numeric_Type XID_Start
Full_Composition_Exclusion Numeric_Value
The guts of the mktables program's algorithm are explained here:
# This program works on all non-provisional properties as of 6.0, though the
# files for some are suppressed from apparent lack of demand for them. You
# can change which are output by changing lists in this program.
#
# The old version of mktables emphasized the term "Fuzzy" to mean Unicode's
# loose matchings rules (from Unicode TR18):
#
# The recommended names for UCD properties and property values are in
# PropertyAliases.txt [Prop] and PropertyValueAliases.txt
# [PropValue]. There are both abbreviated names and longer, more
# descriptive names. It is strongly recommended that both names be
# recognized, and that loose matching of property names be used,
# whereby the case distinctions, whitespace, hyphens, and underbar
# are ignored.
#
# The program still allows Fuzzy to override its determination of if loose
# matching should be used, but it isn't currently used, as it is no longer
# needed; the calculations it makes are good enough.
#
# SUMMARY OF HOW IT WORKS:
# Each file on the list is processed in a loop, using the associated handler
# code for each:
# The PropertyAliases.txt and PropValueAliases.txt files are processed
# first. These files name the properties and property values.
# Objects are created of all the property and property value names
# that the rest of the input should expect, including all synonyms.
# The other input files give mappings from properties to property
# values. That is, they list code points and say what the mapping
# is under the given property. Some files give the mappings for
# just one property; and some for many. This program goes through
# each file and populates the properties from them. Some properties
# are listed in more than one file, and Unicode has set up a
# precedence as to which has priority if there is a conflict. Thus
# the order of processing matters, and this program handles the
# conflict possibility by processing the overriding input files
# last, so that if necessary they replace earlier values.
# After this is all done, the program creates the property mappings not
# furnished by Unicode, but derivable from what it does give.
# The tables of code points that match each property value in each
# property that is accessible by regular expressions are created.
# The Perl-defined properties are created and populated. Many of these
# require data determined from the earlier steps
# Any Perl-defined synonyms are created, and name clashes between Perl
# and Unicode are reconciled and warned about.
# All the properties are written to files
# Any other files are written, and final warnings issued.
More information about the i18n-dev
mailing list