<i18n dev> RL1.2 Compatibility Properties
Tom Christiansen
tchrist at perl.com
Sun Jan 23 09:48:34 PST 2011
RL1.2a Compatibility Properties
To meet this requirement, an implementation shall provide the
properties listed in Annex C. Compatibility Properties, with the
property values as listed there. Such an implementation shall
document whether it is using the Standard Recommendation or
POSIX-compatible properties.
As previously dicussed, Java's regexes fail to meet this
requirement because it uses the exact names given by the
http://www.unicode.org/reports/tr18/#Compatibility_Properties
in ways quite unlike either of the only two allowable options,
the Standard Recommendation or POSIX-compatible properties.
If Java were to implement the actual Unicode definitions for things like
Whitespace, Alphabetic, Uppercase, and Lowercase, then it would be trivial
to use those in a way restricted to ASCII only. But the reverse is not
true: starting with the ASCII-only versions there is no way to augment
them work for Unicode.
For example, ASCII-only versions could be crafted in this fashion:
(?:(?=\p{ASCII})\p{Alphabetic})
(?:(?=\p{ASCII})\p{Whitespace})
(?:(?=\p{ASCII})\p{Uppercase})
(?:(?=\p{ASCII})\p{Lowercase})
The problem is that there is no way to go the other way. One cannot
similarly derive the non-ASCII sets starting with the ASCII-only sets.
But the ASCII-only sets are all that Java gives us, even though it uses the
same names as Unicode uses, but means something completely different.
I would also note that the \w or Word property is discussed here,
something that first took me down this road. For this there is no
POSIX Compatibility option given, only a Standard Recommendation,
which matches the set:
\p{alpha}
\p{gc=Mark}
\p{digit}
\p{gc=Connector_Punctuation}
As previously discussed, unlike in Java the Unicode Alpha property is
strictly defined by Unicode to match far more than merely Letter alone.
Java's ASCII-only version of \w is unusable for deriving the Standard
Recommendation for \w, yet if the Standard Recommendation were implemented,
the reverse would be trivial:
(?:(?=\p{ASCII})\w)
This is (part of) why Unicode has selected the sorts of definitions
it has for its basic properties, and why to be a Level 1 conforming
platform you must support the most basic set spelled out by R1.2,
since once you have the definitions that Unicode requires that you
have, you can use these as building blocks with which to craft
higher-level abstractions as need demands.
Without them, you're dead in the water, stuck in a "You can't get
there from here" situation.
--tom
More information about the i18n-dev
mailing list