<i18n dev> RL1.2 Properties (part 1 of 2)

Tom Christiansen tchrist at perl.com
Sat Jan 22 22:22:55 PST 2011


Java does not meet the requirement of RL1.2.  It provides only 3 of the 11
require properties; 4 it omits altogether, while 4 others it implements in
a fashion contrary to the standard.  Java also neglects the strongly
recommended aspects of this section, which is quite a pity.

>From tr18:

    RL1.2       Properties

    To meet this requirement, an implementation shall provide at 
    least a minimal list of properties, consisting of the following:

        General_Category
        Script
        Alphabetic
        Uppercase
        Lowercase
        White_Space
        Noncharacter_Code_Point
        Default_Ignorable_Code_Point
        ANY
        ASCII
        ASSIGNED

Of those listed above as *shall provide*, Java indeed provides
these three required properties from that minimum set:

    + The ASCII property.

    + The General_Categories like \p{Lu}, although only in their
      short forms; it does not provide the long forms.

    + The Script categories like \p{Greek}, a very *VERY* 
      welcome addition for Unicode 6.0.

Java does not provide these four required properties:

    - Noncharacter_Code_Point
    - Default_Ignorable_Code_Point
    - ANY
    - ASSIGNED

The worst part is that Java gives non-Unicode meanings to 
these four Unicode properties (I'll give details on these
lapses in a separate message):

    * Alphabetic
    * Uppercase
    * Lowercase
    * White_Space

I would like to see all of that addressed that is give above,
and I do not understand how you can claim Level 1 conformance 
without doing so.

There are also "strongly recommended" things that you do not
implement, like loose matching of property names.  That would
not cost you much, I feel.

tr18's section 1.2 also lists several "recommended" properties,
not all of which are binary.  

Properties that are not absolutely required for compliance of
RL1.2, but which I find especially useful, include these binary
properties:

    \p{Dash}
    \p{Quotation_Mark}
    \p{Diacritic}
    \p{Math}

If you are going to do \X for extended grapheme clusters instead
of legacy grapheme clusters, then you will need access to Hangul
Syllable Types, which is not a binary property.

The best place to read up on the full set of UCD properties is at

    http://www.unicode.org/reports/tr44/tr44-4.html#Properties

There are several tables of properties there; at the top of the
file, though, it says:

    1 Introduction

    The Unicode Standard is far more than a simple encoding of characters.
    The standard also associates a rich set of semantics with each encoded
    character--properties that are required for interoperability and
    correct behavior in implementations, as well as for Unicode
    conformance. These semantics are cataloged in the Unicode Character
    Database (UCD), a collection of data files which contain the Unicode
    character code points and character names. The data files define the
    Unicode character properties and mappings between Unicode characters
    (such as case mappings).

That shows how important properties are. The conformance document 
also includes this statement:

    http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf

    Interpretation of characters is a more complex issue for the Unicode
    Standard. It includes the core issue of interpreting code points
    used as characters according to the names and representative glyphs
    shown in the code charts, of course. However, the Unicode Standard
    also specifies character properties, behavior, and interactions
    between characters. Such information about characters is considered
    an integral part of the "character semantics established by this
    standard."

    Information about the properties, behavior, and interactions between
    Unicode characters is provided in the Unicode Character Database and
    in the Unicode Standard Annexes.

That again stresses the importance of properties and interactions between
characters.  Java giving properties the same names that Unicode does but
gives them behaviours that are something else entirely is particularly
vexing.  I cannot see how that is conformant, either.  You have to do
what they say you have to do with the property names they give you.  If
you want your own behaviours, you can choose different property names.
But theirs are reserved to behave as they define them to behave.

I will therefore address the errors I believe Java makes in the
Alphabetic, Uppercase, Lowercase, and White_Space properties in
my next message, part 2 of RL1.2 Properties.

--tom


More information about the i18n-dev mailing list