<i18n dev> RL1.2 Compatibility Properties

Tom Christiansen tchrist at perl.com
Sun Jan 23 09:48:34 PST 2011


    RL1.2a 	Compatibility Properties

    To meet this requirement, an implementation shall provide the
    properties listed in Annex C. Compatibility Properties, with the
    property values as listed there. Such an implementation shall
    document whether it is using the Standard Recommendation or
    POSIX-compatible properties.

As previously dicussed, Java's regexes fail to meet this
requirement because it uses the exact names given by the

    http://www.unicode.org/reports/tr18/#Compatibility_Properties

in ways quite unlike either of the only two allowable options,
the Standard Recommendation or POSIX-compatible properties.

If Java were to implement the actual Unicode definitions for things like
Whitespace, Alphabetic, Uppercase, and Lowercase, then it would be trivial
to use those in a way restricted to ASCII only.  But the reverse is not
true: starting with the ASCII-only versions there is no way to augment 
them work for Unicode.

For example, ASCII-only versions could be crafted in this fashion:

    (?:(?=\p{ASCII})\p{Alphabetic})
    (?:(?=\p{ASCII})\p{Whitespace})
    (?:(?=\p{ASCII})\p{Uppercase})
    (?:(?=\p{ASCII})\p{Lowercase})

The problem is that there is no way to go the other way.  One cannot
similarly derive the non-ASCII sets starting with the ASCII-only sets.  

But the ASCII-only sets are all that Java gives us, even though it uses the
same names as Unicode uses, but means something completely different.

I would also note that the \w or Word property is discussed here, 
something that first took me down this road.  For this there is no 
POSIX Compatibility option given, only a Standard Recommendation, 
which matches the set:

    \p{alpha}
    \p{gc=Mark}
    \p{digit}
    \p{gc=Connector_Punctuation}

As previously discussed, unlike in Java the Unicode Alpha property is
strictly defined by Unicode to match far more than merely Letter alone.

Java's ASCII-only version of \w is unusable for deriving the Standard
Recommendation for \w, yet if the Standard Recommendation were implemented,
the reverse would be trivial:

    (?:(?=\p{ASCII})\w)

This is (part of) why Unicode has selected the sorts of definitions 
it has for its basic properties, and why to be a Level 1 conforming
platform you must support the most basic set spelled out by R1.2, 
since once you have the definitions that Unicode requires that you 
have, you can use these as building blocks with which to craft 
higher-level abstractions as need demands.

Without them, you're dead in the water, stuck in a "You can't get
there from here" situation.

--tom


More information about the i18n-dev mailing list