<i18n dev> Summary of tr18 Level 1 compliance findings

Sun Jan 23 12:52:24 PST 2011

Here is a summary of my findings:

    Compliance      Req Num     Description

       ???           RL1.1      Hex Notation
        no           RL1.2      Properties
        no           RL1.2a     Compatibility Properties
       yes           RL1.3      Subtraction and Intersection
        no           RL1.4      Simple Word Boundaries
       yes           RL1.5      Simple Loose Matches
       yes           RL1.6      Line Boundaries
       ???           RL1.7      Code Points

Because there is at least one unmet requirement for Level 1 Unicode Support
in regular expressions, Java is not currently Level 1 compliant and so does
not provide even the most basic level of functionality needed for working
with regexes according to version 6.0 of the Unicode Standard.

I have not assessed the work required to allow it to become so.

Notes:

    RL1.1   This is marked questionable because I am of the opinion
            that the requirement of being able to specify a code point
            using hex notation without regard to its internal or external
            serialized representation is not met, but Sherman is of the
            opinion that it is.  However, it is low priority and easily
            remedied through the addition of \x{XXXX}.

    RL1.2   This has many different sorts of problems.

    RL1.2a  This has several problems.

    RL1.4   This does not meet the requirements.

    RL1.7   This is very close, save for the problem of ill-formed UTF-16.

Furthermore, tr18 has exactly two strong recommendations,
both of which Java fails to follow.

    Strong Recommendation #1:

    The recommended names for UCD properties and property values are in
    PropertyAliases.txt [Prop] and PropertyValueAliases.txt [PropValue].
    There are both abbreviated names and longer, more descriptive names. It
    is strongly recommended that both names be recognized, and that loose
    matching of property names be used, whereby the case distinctions,
    whitespace, hyphens, and underbar are ignored.

Java fails to meet this recommendation in many ways:

    SR1.0: Java does not allow for the loose matching of property names.

    SR1.1: Java does not use the recommended names for UCD
           properties and values.

    SR1.2: Java omits most of those recommended names.

    SR1.3: Java uses some recommended names contrary to their
           required definitions.

    SR1.4: Java does not allow both the abbreviated names
           like \p{Nl} and the longer \p{Letter_Number} version.

The other strong recommendation is this one.

    Strong Recommendation #2:

    It is strongly recommended that there be a regular expression
    meta-character, such as "\R", for matching all line ending
    characters and sequences listed above (e.g. in #1). It would
    thus be shorthand for:

        ( \u000D\u000A | [\u000A\u000B\u000C\u000D\u0085\u2028\u2029] )

For Java to legitimately claim Level 1 compliance with Unicode 6.0
according to tr18, it must at a bare minimum correct all the "no"
compliance categories to "yes".  Without that, the claim is false.

For Java to be *useful* for processing Unicode text, it should go
beyond these barest of minima.  A good starting point in that
direction would be to finally satisfy SR#1 and SR#2 above.

I would also like to see the two "???" matters cleared up, because
I believe the intention and pre-existing belief is that they *do*
work.  Or at least, that they should--bugs notwithstanding.

Java also claims it meets RL2.1 on Canonical Compatibility.  This
is another area where I believe the intention and pre-existing believe
are that it meets that requirement, but where edge-case bugs get
in the way of doing so.

I hope this finally answers your question about why I don't believe
Java's regexes meet Level 1 requirements, the minimal functionality
needed for handling Unicode text in regular expressions per tr18.

To end on a positive note, I am very much looking forward to \X
working for grapheme clusters, and very preferably for extended
grapheme clusters not merely legacy ones.

--tom