<i18n dev> Summary of tr18 Level 1 compliance findings
Tom Christiansen
tchrist at perl.com
Sun Jan 23 12:52:24 PST 2011
Here is a summary of my findings:
Compliance Req Num Description
??? RL1.1 Hex Notation
no RL1.2 Properties
no RL1.2a Compatibility Properties
yes RL1.3 Subtraction and Intersection
no RL1.4 Simple Word Boundaries
yes RL1.5 Simple Loose Matches
yes RL1.6 Line Boundaries
??? RL1.7 Code Points
Because there is at least one unmet requirement for Level 1 Unicode Support
in regular expressions, Java is not currently Level 1 compliant and so does
not provide even the most basic level of functionality needed for working
with regexes according to version 6.0 of the Unicode Standard.
I have not assessed the work required to allow it to become so.
Notes:
RL1.1 This is marked questionable because I am of the opinion
that the requirement of being able to specify a code point
using hex notation without regard to its internal or external
serialized representation is not met, but Sherman is of the
opinion that it is. However, it is low priority and easily
remedied through the addition of \x{XXXX}.
RL1.2 This has many different sorts of problems.
RL1.2a This has several problems.
RL1.4 This does not meet the requirements.
RL1.7 This is very close, save for the problem of ill-formed UTF-16.
Furthermore, tr18 has exactly two strong recommendations,
both of which Java fails to follow.
Strong Recommendation #1:
The recommended names for UCD properties and property values are in
PropertyAliases.txt [Prop] and PropertyValueAliases.txt [PropValue].
There are both abbreviated names and longer, more descriptive names. It
is strongly recommended that both names be recognized, and that loose
matching of property names be used, whereby the case distinctions,
whitespace, hyphens, and underbar are ignored.
Java fails to meet this recommendation in many ways:
SR1.0: Java does not allow for the loose matching of property names.
SR1.1: Java does not use the recommended names for UCD
properties and values.
SR1.2: Java omits most of those recommended names.
SR1.3: Java uses some recommended names contrary to their
required definitions.
SR1.4: Java does not allow both the abbreviated names
like \p{Nl} and the longer \p{Letter_Number} version.
The other strong recommendation is this one.
Strong Recommendation #2:
It is strongly recommended that there be a regular expression
meta-character, such as "\R", for matching all line ending
characters and sequences listed above (e.g. in #1). It would
thus be shorthand for:
( \u000D\u000A | [\u000A\u000B\u000C\u000D\u0085\u2028\u2029] )
For Java to legitimately claim Level 1 compliance with Unicode 6.0
according to tr18, it must at a bare minimum correct all the "no"
compliance categories to "yes". Without that, the claim is false.
For Java to be *useful* for processing Unicode text, it should go
beyond these barest of minima. A good starting point in that
direction would be to finally satisfy SR#1 and SR#2 above.
I would also like to see the two "???" matters cleared up, because
I believe the intention and pre-existing belief is that they *do*
work. Or at least, that they should--bugs notwithstanding.
Java also claims it meets RL2.1 on Canonical Compatibility. This
is another area where I believe the intention and pre-existing believe
are that it meets that requirement, but where edge-case bugs get
in the way of doing so.
I hope this finally answers your question about why I don't believe
Java's regexes meet Level 1 requirements, the minimal functionality
needed for handling Unicode text in regular expressions per tr18.
To end on a positive note, I am very much looking forward to \X
working for grapheme clusters, and very preferably for extended
grapheme clusters not merely legacy ones.
--tom
More information about the i18n-dev
mailing list