<i18n dev> RL1.2 Properties (part 1 of 2)
Xueming Shen
xueming.shen at oracle.com
Sun Jan 23 00:07:53 PST 2011
Tom,
The Unicode/java version of lowercase, uppercase, withespace and letter
character classes are
provided via \p{javaXYZ}, and the \p{Lower/Upper/Alpha/Space} are
specified/implemented
for POSIX version, which is clearly documented in the API document. I
would not use "worst"
for this. I don't think the "conformance" requests the implementation to
use exactly the
name specified in standard.
The following classes/properties are actually supported/implemented,
while only the \p{javaLowerCase},
\p{javaUpperCase}, \p{javaWhitespace} and \p{javaMirrored} are
explicitly documented in Pattern
API, the rest are covered by notes as "Categories that behave like the
java.lang.Character boolean
ismethodname methods are available through the same \p{prop} syntax..."
\p{javaLowerCase}
\p{javaUpperCase}
\p{javaTitleCase}
\p{javaDigit}
\p{javaDefined}
\p{javaLetter}
\p{javaLetterOrDigit}
\p{javaJavaIdentifierStart}
\p{javaJavaIdentifierPart}
\p{javaUnicodeIdentifierStart}
\p{javaUnicodeIdentifierPart}
\p{javaIdentifierIgnorable}
\p{javaSpaceChar}
\p{javaWhitespace}
\p{javaISOControl}
\p{javaMirrored}
It appears the "noncharacter_cp and "default_ignorable_cp" are missing
from the list, will take a
look later, but I guess these 2 are really not that "significant".
-Sherman
On 1.22.2011 10:22, Tom Christiansen wrote:
> Java does not meet the requirement of RL1.2. It provides only 3 of the 11
> require properties; 4 it omits altogether, while 4 others it implements in
> a fashion contrary to the standard. Java also neglects the strongly
> recommended aspects of this section, which is quite a pity.
>
> From tr18:
>
> RL1.2 Properties
>
> To meet this requirement, an implementation shall provide at
> least a minimal list of properties, consisting of the following:
>
> General_Category
> Script
> Alphabetic
> Uppercase
> Lowercase
> White_Space
> Noncharacter_Code_Point
> Default_Ignorable_Code_Point
> ANY
> ASCII
> ASSIGNED
>
> Of those listed above as *shall provide*, Java indeed provides
> these three required properties from that minimum set:
>
> + The ASCII property.
>
> + The General_Categories like \p{Lu}, although only in their
> short forms; it does not provide the long forms.
>
> + The Script categories like \p{Greek}, a very *VERY*
> welcome addition for Unicode 6.0.
>
> Java does not provide these four required properties:
>
> - Noncharacter_Code_Point
> - Default_Ignorable_Code_Point
> - ANY
> - ASSIGNED
>
> The worst part is that Java gives non-Unicode meanings to
> these four Unicode properties (I'll give details on these
> lapses in a separate message):
>
> * Alphabetic
> * Uppercase
> * Lowercase
> * White_Space
>
> I would like to see all of that addressed that is give above,
> and I do not understand how you can claim Level 1 conformance
> without doing so.
>
> There are also "strongly recommended" things that you do not
> implement, like loose matching of property names. That would
> not cost you much, I feel.
>
> tr18's section 1.2 also lists several "recommended" properties,
> not all of which are binary.
>
> Properties that are not absolutely required for compliance of
> RL1.2, but which I find especially useful, include these binary
> properties:
>
> \p{Dash}
> \p{Quotation_Mark}
> \p{Diacritic}
> \p{Math}
>
> If you are going to do \X for extended grapheme clusters instead
> of legacy grapheme clusters, then you will need access to Hangul
> Syllable Types, which is not a binary property.
>
> The best place to read up on the full set of UCD properties is at
>
> http://www.unicode.org/reports/tr44/tr44-4.html#Properties
>
> There are several tables of properties there; at the top of the
> file, though, it says:
>
> 1 Introduction
>
> The Unicode Standard is far more than a simple encoding of characters.
> The standard also associates a rich set of semantics with each encoded
> character--properties that are required for interoperability and
> correct behavior in implementations, as well as for Unicode
> conformance. These semantics are cataloged in the Unicode Character
> Database (UCD), a collection of data files which contain the Unicode
> character code points and character names. The data files define the
> Unicode character properties and mappings between Unicode characters
> (such as case mappings).
>
> That shows how important properties are. The conformance document
> also includes this statement:
>
> http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf
>
> Interpretation of characters is a more complex issue for the Unicode
> Standard. It includes the core issue of interpreting code points
> used as characters according to the names and representative glyphs
> shown in the code charts, of course. However, the Unicode Standard
> also specifies character properties, behavior, and interactions
> between characters. Such information about characters is considered
> an integral part of the "character semantics established by this
> standard."
>
> Information about the properties, behavior, and interactions between
> Unicode characters is provided in the Unicode Character Database and
> in the Unicode Standard Annexes.
>
> That again stresses the importance of properties and interactions between
> characters. Java giving properties the same names that Unicode does but
> gives them behaviours that are something else entirely is particularly
> vexing. I cannot see how that is conformant, either. You have to do
> what they say you have to do with the property names they give you. If
> you want your own behaviours, you can choose different property names.
> But theirs are reserved to behave as they define them to behave.
>
> I will therefore address the errors I believe Java makes in the
> Alphabetic, Uppercase, Lowercase, and White_Space properties in
> my next message, part 2 of RL1.2 Properties.
>
> --tom
More information about the i18n-dev
mailing list