<i18n dev> RL1.2 Properties (part 2 of 2)

Sun Jan 23 00:14:45 PST 2011

This message explains precisely how Java fails to provide
any way to access these four required properties from RL1.2:

    Alphabetic
    Lowercase
    Uppercase
    Whitespace

Since Java does not provide them *by any name*, and RL1.2 specifically
includes those four in its "To meet this requirement, an implementation
shall provide at least..." list, Java does not conform to RL1.2.  Since
Java does not meet RL1.2, it therefore cannot be Level 1 conformant per
tr18, and so this claim from j.u.r.Pattern's javadoc is incorrect:

    This class is in conformance with Level 1 of Unicode Technical
    Standard #18: Unicode Regular Expression Guidelines, plus RL2.1
    Canonical Equivalents.

============================================================

Sherman wrote:

> As regarding the POSIX properties. In Java RegEx Unicode
> Alphabetic, Lowercase or Whitespace properties are supported by
> using \p{javaLetter},  \p{javaLowerCase}, \p{javaUpperCase} or
> \p{javaWhitespace}.

That has certainly not been my experience.  All the things you
say are the same are things that I believe are *not* the same.

============
 Alphabetic
============

The Unicode Alphabetic property is not the same as its Letter
property. All \p{Letter} code points are \p{Alphabetic}, but not
all \p{Alphabetic} code points are \p{Letter}.  According to

    http://download.oracle.com/javase/7/docs/api/java/lang/Character.html#isLetter(int)

j.l.Character.isLetter() -- and therefore \p{javaLetter} -- is:

    A character is considered to be a letter if its general
    category type, provided by getType(codePoint), is any of
    the following:

        UPPERCASE_LETTER
        LOWERCASE_LETTER
        TITLECASE_LETTER
        MODIFIER_LETTER
        OTHER_LETTER

That is the same as \pL, the Unicode GC=Letter  property.  But
Unicode Alphabetic is *not* the same Unicode Letter; rather
the Unicode Alphabetic property is defined by tr44 to be:

    Lu + Ll + Lt + Lm + Lo + Nl + Other_Alphabetic

The last two are where javaLetter fails.  It does not detect Letter_Number
code points (Nl), nor does it consider all the many Other_Alphabetic code
points.  Other_Alphabetic is one of those internal Unicode properties used
from the UCD PropList.txt file exclusively used to generate the Alphabetic
property.  It includes various code points of general category Mn and Mc,
but there are many other Mn and Mc code points which are *not*
Other_Alphabetic.

As of Unicode 6.0, there are 811 code points in the Basic Multilingual
Plane (plane 0) which are \p{Alphabetic} but not \pL, by which I mean that
they have the Alphabetic property but lack the Letter property.  There are
also 195 such code points up in the so-called "astral" planes (planes 1-16).

Consider this code point:

    <Ⅰ>  U+2160  ROMAN NUMERAL ONE

In Java, you will find that the string "\u2160" (which is a Nl or
Letter_Number code point) fails to match the pattern \pL, which is correct,
but also fails to match the property \p{javaLetter}.  If javaLetter were
truly the Unicode Letter property, it would succeed, but since it fails,
those are not the same.  Therefore you cannot say that the Unicode
Alphabetic property is the same as the javaLetter property; they are
different things.

To demonstrate how it should work using U+2160, ROMAN NUMERAL ONE:

    $ perl -le 'print chr(0x2160) =~ /\pL/ || 0'
    0

    $ perl -le 'print chr(0x2160) =~ /\p{Alphabetic}/ || 0'
    1

=========================
 Uppercase and Lowercase
=========================

The Unicode Lowercase property is not the same as its Ll (Lowercase_Letter)
property.  Again, although all \p{Ll} code points are \p{Lowercase}, not
all \p{Lowercase} code points are also \p{Ll}.  As with Alphabetic, we have
others to consider:

    Lowercase = Lowercase_Letter + Other_Lowercase
    Uppercase = Uppercase_Letter + Other_Uppercase

Specifically, there are 159 code \p{Lowercase} code points in the BMP which
are not also \p{Ll}.  The same situation occurs with Lu (Uppercase_Letter)
versus Uppercase: there are 42 BMP code points which are \p{Uppercase} but
which are not \p{Lu}.

Testing in Java with "\u2160", the \p{javaUpperCase} property
fails to match, but ought to if is meant to represent the
Unicode Uppercase property.  Therefore they are not the same.

Again demonstrating with U+2160, ROMAN NUMERAL ONE:

    $ perl -le 'print chr(0x2160) =~ /\p{Lu}/ || 0'
    0

    $ perl -le 'print chr(0x2160) =~ /\p{Uppercase}/ || 0'
    1

These things *do* matter.  tr18 requires that Lowercase and Uppercase must
be supported for Level 1 conformance.  Moreover, tr44 specifically tells us
that these are *not* to be considered second-class citizens among Unicode
character properties, according to:

    http://www.unicode.org/reports/tr44/tr44-4.html#Properties

    Derived character properties are not considered second-class citizens
    among Unicode character properties. They are defined to make
    implementation of important algorithms easier to state. Included among
    the first-class derived properties important for such implementations
    are: Uppercase, Lowercase, XID_Start, XID_Continue, Math, and
    Default_Ignorable_Code_Point, all defined in DerivedCoreProperties.txt,
    as well as derived properties for the optimization of normalization,
    defined in DerivedNormalizationProps.txt.

BTW, I don't believe that Java supports the Unicode casing
aliases, so that you can use \p{LC} or \p{L&} as a convenient
shorthand for [\p{Lu}\p{Lt}\p{Lu}].  I don't know why it doesn't,
but it would be nice to see them supported.

============
 Whitespace
============

The Unicode Whitespace property is not the same as Java's
\p{javaWhitespace} property per your assertion.  Unicode defines 25 code
points as having the Whitespace property.  Of these 25, \p{javaWhitespace}
fails to correctly match code points U+85, U+A0, U+2007, and U+202F.
Therefore, one cannot use \p{javaWhitespace} to detect Unicode whitespace.

It does not *matter* that it is documented not to match those in Java.
Because Unicode documents its Whitespace property to indeed match those,
javaWhitespace and Unicode Whitespace are not the same thing.

This came up at work.  We had Java code that thought to use the Java
whitespace property when tokenizing Unicode plain text.  The corpus, the
PubMed Central Open Access set, is just *full* of U+A0, NON-BREAK SPACE.
This needs to be treated as the whitespace that Unicode says it is.
We were getting wrong answers until we through out the Java whitespace
definition and used the Unicode one.

======================
 Namespace Collisions
======================

Sherman wrote:

> The \p{Lower/Upper/ASCII/Alpha...}, as noticed, are clearly
> specified by the Java RegEx specification[1] that are for
> US_ASCII only

Although I don't mind that ASCII should work only on ASCII,
for the others, there is a big problem.  If you look in the
list of official property aliases:

  http://www.unicode.org/Public/UNIDATA/PropertyAliases.txt

You will see that those all have defined meanings in Unicode:

    Alpha     ; Alphabetic
    Lower     ; Lowercase
    Upper     ; Uppercase

Those are completely official names.  The Unicode Alpha property is by
definition identical to its Alphabetic property, its Lower the same as
its Lowercase, and its Upper the same as its Upper.  As already explained,
these are in turn different from \pL, \p{Ll}, and \p{Lu}.

It is highly regrettable that you have used ASCII-only definitions
for those, but that is not what they are.  And you cannot even get
away with claiming that you are using POSIX compatible versions
detailed in

    http://www.unicode.org/reports/tr18/#Compatibility_Properties

    The following are recommended assignments for compatibility property
    names, for use in Regular Expressions. There are two alternatives: the
    Standard Recommendation and the POSIX Compatible versions. Applications
    should use the former wherever possible. The latter is modified to meet
    the formal requirements of [POSIX], and also to maintain (as much as
    possible) compatibility with the POSIX usage in practice.

You cannot use the non-Standard Recommendation here, because those do not
have non-Unicode alternatives.

While we're on it, that says that the Unicode Space property must be
equivalent to the Unicode Whitespace property, which we have shown is
different from the javaWhitepace property.

So Java is using names that Unicode defines in ways that are
completely differently from the what Unicode says those names
must all mean.  I find that particularly wicked.

================
 Loose Matching
================

By the way, I find it very counterintuitive that I cannot use
javaWhiteSpace for javaWhitespace but must use javaLowerCase not
javaLowercase.  Both sets should be allowed according to the
"strongly recommended" practice of loose matching of property names.
tr18 section 1.2 gives this "strong recommendation":

    The recommended names for UCD properties and property values are in
    PropertyAliases.txt [Prop] and PropertyValueAliases.txt [PropValue].
    There are both abbreviated names and longer, more descriptive names.
    It is strongly recommended that both names be recognized, and that loose
    matching of property names be used, whereby the case distinctions,
    whitespace, hyphens, and underbar are ignored.

And under RL1.2 it again reads:

    Note: Because it is recommended that the property syntax be lenient
          as to spaces, casing, hyphens and underbars, any of the
          following should be equivalent: \p{Lu}, \p{lu}, \p{uppercase
          letter}, \p{uppercase letter}, \p{Uppercase_Letter}, and
          \p{uppercaseletter}

This is explained in more detail in 5.7 Matching Rules from
tr44, which reads in part...

     http://www.unicode.org/reports/tr44/tr44-4.html#Matching_Rules

    When matching Unicode character property names and values, it is strongly
    recommended that all Property and Property Value Aliases be recognized. For
    best results in matching, rather than using exact binary comparisons, the
    following loose matching rules should be observed.

    [...]

    Property aliases and property value aliases are symbolic values. When
    comparing them, use loose matching rule UAX44-LM3.

    UAX44-LM3. Ignore case, whitespace, underscore ('_'), and hyphens.

        * "linebreak" is equivalent to "Line_Break" or "Line-break"
        * "lb=BA" is equivalent to "lb=ba" or "LB=BA"

I'm sorry this is so long, but I didn't want to break it up since
it is all closely related.

Thank you for all your hard work!

--tom