<i18n dev> RL1.2 Properties (part 2 of 2)
Tom Christiansen
tchrist at perl.com
Sun Jan 23 00:14:45 PST 2011
This message explains precisely how Java fails to provide
any way to access these four required properties from RL1.2:
Alphabetic
Lowercase
Uppercase
Whitespace
Since Java does not provide them *by any name*, and RL1.2 specifically
includes those four in its "To meet this requirement, an implementation
shall provide at least..." list, Java does not conform to RL1.2. Since
Java does not meet RL1.2, it therefore cannot be Level 1 conformant per
tr18, and so this claim from j.u.r.Pattern's javadoc is incorrect:
This class is in conformance with Level 1 of Unicode Technical
Standard #18: Unicode Regular Expression Guidelines, plus RL2.1
Canonical Equivalents.
============================================================
Sherman wrote:
> As regarding the POSIX properties. In Java RegEx Unicode
> Alphabetic, Lowercase or Whitespace properties are supported by
> using \p{javaLetter}, \p{javaLowerCase}, \p{javaUpperCase} or
> \p{javaWhitespace}.
That has certainly not been my experience. All the things you
say are the same are things that I believe are *not* the same.
============
Alphabetic
============
The Unicode Alphabetic property is not the same as its Letter
property. All \p{Letter} code points are \p{Alphabetic}, but not
all \p{Alphabetic} code points are \p{Letter}. According to
http://download.oracle.com/javase/7/docs/api/java/lang/Character.html#isLetter(int)
j.l.Character.isLetter() -- and therefore \p{javaLetter} -- is:
A character is considered to be a letter if its general
category type, provided by getType(codePoint), is any of
the following:
UPPERCASE_LETTER
LOWERCASE_LETTER
TITLECASE_LETTER
MODIFIER_LETTER
OTHER_LETTER
That is the same as \pL, the Unicode GC=Letter property. But
Unicode Alphabetic is *not* the same Unicode Letter; rather
the Unicode Alphabetic property is defined by tr44 to be:
Lu + Ll + Lt + Lm + Lo + Nl + Other_Alphabetic
The last two are where javaLetter fails. It does not detect Letter_Number
code points (Nl), nor does it consider all the many Other_Alphabetic code
points. Other_Alphabetic is one of those internal Unicode properties used
from the UCD PropList.txt file exclusively used to generate the Alphabetic
property. It includes various code points of general category Mn and Mc,
but there are many other Mn and Mc code points which are *not*
Other_Alphabetic.
As of Unicode 6.0, there are 811 code points in the Basic Multilingual
Plane (plane 0) which are \p{Alphabetic} but not \pL, by which I mean that
they have the Alphabetic property but lack the Letter property. There are
also 195 such code points up in the so-called "astral" planes (planes 1-16).
Consider this code point:
<Ⅰ> U+2160 ROMAN NUMERAL ONE
In Java, you will find that the string "\u2160" (which is a Nl or
Letter_Number code point) fails to match the pattern \pL, which is correct,
but also fails to match the property \p{javaLetter}. If javaLetter were
truly the Unicode Letter property, it would succeed, but since it fails,
those are not the same. Therefore you cannot say that the Unicode
Alphabetic property is the same as the javaLetter property; they are
different things.
To demonstrate how it should work using U+2160, ROMAN NUMERAL ONE:
$ perl -le 'print chr(0x2160) =~ /\pL/ || 0'
0
$ perl -le 'print chr(0x2160) =~ /\p{Alphabetic}/ || 0'
1
=========================
Uppercase and Lowercase
=========================
The Unicode Lowercase property is not the same as its Ll (Lowercase_Letter)
property. Again, although all \p{Ll} code points are \p{Lowercase}, not
all \p{Lowercase} code points are also \p{Ll}. As with Alphabetic, we have
others to consider:
Lowercase = Lowercase_Letter + Other_Lowercase
Uppercase = Uppercase_Letter + Other_Uppercase
Specifically, there are 159 code \p{Lowercase} code points in the BMP which
are not also \p{Ll}. The same situation occurs with Lu (Uppercase_Letter)
versus Uppercase: there are 42 BMP code points which are \p{Uppercase} but
which are not \p{Lu}.
Testing in Java with "\u2160", the \p{javaUpperCase} property
fails to match, but ought to if is meant to represent the
Unicode Uppercase property. Therefore they are not the same.
Again demonstrating with U+2160, ROMAN NUMERAL ONE:
$ perl -le 'print chr(0x2160) =~ /\p{Lu}/ || 0'
0
$ perl -le 'print chr(0x2160) =~ /\p{Uppercase}/ || 0'
1
These things *do* matter. tr18 requires that Lowercase and Uppercase must
be supported for Level 1 conformance. Moreover, tr44 specifically tells us
that these are *not* to be considered second-class citizens among Unicode
character properties, according to:
http://www.unicode.org/reports/tr44/tr44-4.html#Properties
Derived character properties are not considered second-class citizens
among Unicode character properties. They are defined to make
implementation of important algorithms easier to state. Included among
the first-class derived properties important for such implementations
are: Uppercase, Lowercase, XID_Start, XID_Continue, Math, and
Default_Ignorable_Code_Point, all defined in DerivedCoreProperties.txt,
as well as derived properties for the optimization of normalization,
defined in DerivedNormalizationProps.txt.
BTW, I don't believe that Java supports the Unicode casing
aliases, so that you can use \p{LC} or \p{L&} as a convenient
shorthand for [\p{Lu}\p{Lt}\p{Lu}]. I don't know why it doesn't,
but it would be nice to see them supported.
============
Whitespace
============
The Unicode Whitespace property is not the same as Java's
\p{javaWhitespace} property per your assertion. Unicode defines 25 code
points as having the Whitespace property. Of these 25, \p{javaWhitespace}
fails to correctly match code points U+85, U+A0, U+2007, and U+202F.
Therefore, one cannot use \p{javaWhitespace} to detect Unicode whitespace.
It does not *matter* that it is documented not to match those in Java.
Because Unicode documents its Whitespace property to indeed match those,
javaWhitespace and Unicode Whitespace are not the same thing.
This came up at work. We had Java code that thought to use the Java
whitespace property when tokenizing Unicode plain text. The corpus, the
PubMed Central Open Access set, is just *full* of U+A0, NON-BREAK SPACE.
This needs to be treated as the whitespace that Unicode says it is.
We were getting wrong answers until we through out the Java whitespace
definition and used the Unicode one.
======================
Namespace Collisions
======================
Sherman wrote:
> The \p{Lower/Upper/ASCII/Alpha...}, as noticed, are clearly
> specified by the Java RegEx specification[1] that are for
> US_ASCII only
Although I don't mind that ASCII should work only on ASCII,
for the others, there is a big problem. If you look in the
list of official property aliases:
http://www.unicode.org/Public/UNIDATA/PropertyAliases.txt
You will see that those all have defined meanings in Unicode:
Alpha ; Alphabetic
Lower ; Lowercase
Upper ; Uppercase
Those are completely official names. The Unicode Alpha property is by
definition identical to its Alphabetic property, its Lower the same as
its Lowercase, and its Upper the same as its Upper. As already explained,
these are in turn different from \pL, \p{Ll}, and \p{Lu}.
It is highly regrettable that you have used ASCII-only definitions
for those, but that is not what they are. And you cannot even get
away with claiming that you are using POSIX compatible versions
detailed in
http://www.unicode.org/reports/tr18/#Compatibility_Properties
The following are recommended assignments for compatibility property
names, for use in Regular Expressions. There are two alternatives: the
Standard Recommendation and the POSIX Compatible versions. Applications
should use the former wherever possible. The latter is modified to meet
the formal requirements of [POSIX], and also to maintain (as much as
possible) compatibility with the POSIX usage in practice.
You cannot use the non-Standard Recommendation here, because those do not
have non-Unicode alternatives.
While we're on it, that says that the Unicode Space property must be
equivalent to the Unicode Whitespace property, which we have shown is
different from the javaWhitepace property.
So Java is using names that Unicode defines in ways that are
completely differently from the what Unicode says those names
must all mean. I find that particularly wicked.
================
Loose Matching
================
By the way, I find it very counterintuitive that I cannot use
javaWhiteSpace for javaWhitespace but must use javaLowerCase not
javaLowercase. Both sets should be allowed according to the
"strongly recommended" practice of loose matching of property names.
tr18 section 1.2 gives this "strong recommendation":
The recommended names for UCD properties and property values are in
PropertyAliases.txt [Prop] and PropertyValueAliases.txt [PropValue].
There are both abbreviated names and longer, more descriptive names.
It is strongly recommended that both names be recognized, and that loose
matching of property names be used, whereby the case distinctions,
whitespace, hyphens, and underbar are ignored.
And under RL1.2 it again reads:
Note: Because it is recommended that the property syntax be lenient
as to spaces, casing, hyphens and underbars, any of the
following should be equivalent: \p{Lu}, \p{lu}, \p{uppercase
letter}, \p{uppercase letter}, \p{Uppercase_Letter}, and
\p{uppercaseletter}
This is explained in more detail in 5.7 Matching Rules from
tr44, which reads in part...
http://www.unicode.org/reports/tr44/tr44-4.html#Matching_Rules
When matching Unicode character property names and values, it is strongly
recommended that all Property and Property Value Aliases be recognized. For
best results in matching, rather than using exact binary comparisons, the
following loose matching rules should be observed.
[...]
Property aliases and property value aliases are symbolic values. When
comparing them, use loose matching rule UAX44-LM3.
UAX44-LM3. Ignore case, whitespace, underscore ('_'), and hyphens.
* "linebreak" is equivalent to "Line_Break" or "Line-break"
* "lb=BA" is equivalent to "lb=ba" or "LB=BA"
I'm sorry this is so long, but I didn't want to break it up since
it is all closely related.
Thank you for all your hard work!
--tom
More information about the i18n-dev
mailing list