<i18n dev> java.lang.Character lacuna #1 of 2

Tom Christiansen tchrist at perl.com
Fri Apr 15 08:23:56 PDT 2011


>I have filed CR/RFE 7036910: 
>j.l.Character.toLowerCaseCharArray/toTitleCaseCharArray for this request.

Thanks very much.

> The j.l.Character.toLowerCase/toUpperCase() suggests to use
> String.toLower/UpperCase() for case mapping, if you want 1:M mapping
> taken care. And if you trust the API:-), which you should in this
> case, you will find that String.toLowerCase/toUpperCase() do handle
> 1:M correctly.

> Yes, we don't have a toLowerCaseCharArray() in j.l.c, however, as you
> noticed that there is ONLY one 1:M case mapping for toLowerCase, at
> least for now, and our String.toLowerCase() implementation
> "hardcodeds" that u+0130 as the special case.

Ahah good.  I had a feeling I should have looked the the String source.

> That said, I yet to dig out the history of toUpperCaseCharArray... and
> I agree, from API design point of view, it would be more nature to
> have the pair.

Well, the thing that seems me to be more missing is the toTitleCaseCharArray
since it would be more apt to come up.  Right now you can't get at the
full casemapping for titlecase from Java, and you do sometimes need it.
It's harder to come up with reasonable demos in Latin than in Greek, since
mostly in Latin we have the ff/fi/ffl/ffi ligatures, whereas in Greek there
are lots of examples where you need full titlecasing, not simple.  Here's one:
  
    lower: ᾲ στο διάολο
    lower: \x{1FB2} \x{3C3}\x{3C4}\x{3BF} \x{3B4}\x{3B9}\x{3AC}\x{3BF}\x{3BB}\x{3BF}

    title: Ὰͅ Στο Διάολο
    title: \x{1FBA}\x{345} \x{3A3}\x{3C4}\x{3BF} \x{394}\x{3B9}\x{3AC}\x{3BF}\x{3BB}\x{3BF}

    upper: ᾺΙ ΣΤΟ ΔΙΆΟΛΟ
    upper: \x{1FBA}\x{399} \x{3A3}\x{3A4}\x{39F} \x{394}\x{399}\x{386}\x{39F}\x{39B}\x{39F}

That's because U+1FB2 goes to U+1FBA U+0399 for uppercase, but
it goes to U+1FBA U+0345 in titlecase.  

The lowercase 
    "\N{GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI}" 
becomes this two-codepoint sequence in uppercase:
    "\N{GREEK CAPITAL LETTER ALPHA WITH VARIA}\N{GREEK CAPITAL LETTER IOTA}"
but becomes this two-codepoint sequence in uppercase:
    "\N{GREEK CAPITAL LETTER ALPHA WITH VARIA}\N{COMBINING GREEK YPOGEGRAMMENI}"

That's why the U+0345 COMBINING GREEK YPOGEGRAMMENI is a \p{Lowercase} code
point, despite its being \p{GC=Mn}.  

It's this kind of thing that set me to fixing the j.l.Character documentation,
because isLowerCase and isUppercase and such were misstating what they did.
All they do is test for \p{GC=Ll} and \p{GC=Lu} respectively; they do not
actually test for \p{Lowercase} and \p{Uppercase}, which are binary properties
that work on more than just letters. 

I don't know what you can do with the API.  If one had it to do over again, 
it would be clearly preferable to distiguish

    isLowerCaseLetter vs isLowerCase 
    isUpperCaseLetter vs isUppercase 

where the latter is the full test and the former is only for letters. But
you of course can't do that now, so you're stuck with the existing name.

I've tried think of an alternate name for \p{Lowercase} that allows you to
stick with the existing naming (which of course is absolutely mandatory).
The problem is that it fails the Huffman encoding principle of making the
shorter thing the more commonly used variant, but I can't see a way

    isLowerCase                 \p{GC=Ll}   \p{Lowercase_Letter}
    isUpperCase                 \p{GC=Lu}   \p{Uppercase_Letter}

I think people will go nuts if they have to type this:

    isLowerCaseCodePoint        \p{Lowercase}  \p{Lower}
    isUpperCaseCodePoint        \p{Uppercase}  \p{Upper}

I suppose you might be able to do this:

    isLower                     \p{Lowercase}  \p{Lower}
    isUpper                     \p{Uppercase}  \p{Upper}

Not that I'm using the official Unicode property names there, 
because PropertyAliases.txt defines

    Lower     ; Lowercase
    Upper     ; Uppercase

And of course Uppercase and Lowercase are the properties that
work for all code points, not just Letters.  That is, they're
the non-GC versions from:

    http://www.unicode.org/reports/tr44/#Property_Index

The \p{upper} and \p{lower} style is also what tr18's RL1.2a 
uses for compatibility properties: those are in lines 2 and 3
of the compat table.

    http://unicode.org/reports/tr18/#Compatibility_Properties

> Yes, we do have a RFE 6423415: (str) Add String.toTitleCase()

> But given the nature of "title case", the String#toTitleCase() might
> not be what you would like it to be. It would be strange if
> String#toTitleCase() does the similar thing the
> String.toLower/UpperCase() do, in which it title-case-maps all
> characters inside the String, most people probably would expect it
> only title-case-map the first character of the "title string". RFE
> 6423415 has very low priority for now.

> It might be more reasonable to have j.l.Character.toTitleCaseCharArray() 
> instead of j.l.String.toTitleCase().

Yes, I think you're right.

In fact, Perl does not provide a function that will titlecase *all* of a
string (although you can always write a loop).  We only have a function to
titlecase the string's first code point, called for compatibility reasons
ucfirst() and available as the "\u" string escape.  That is, "\u$a"
compiles to ucfirst($a).  For the whole string, we use uc() (or \U) which
uppercases, not titlecases.  And of course lc (or \L) lowercases the whole
string, although lcfirst (\u) just does the first character.

That means to generate the strings above, I wrote 

    s/(\w+)/\u\L$1/g;

which as a code expression instead of string intepolation would 

    s/(\w+)/ucfirst(lc($1))/ge;

That's a bit cavalier, of course, since \w grabs more than just
things that change case.  However, you can't write:

    s/(\pL+)/\u\L$1/g;

because that misses the nonletters.  The \p{Alphabetic} property
should I believe work for this.

    s/(\p{alpha}+)/\u\L$1/g;

That works because Perl uses the entry from PropertyAliases.txt:

    Alpha     ; Alphabetic

which is also the RL1.2a guideline, even in POSIX compat mode.  

But because we have access to all Unicode properties, there are more
arguably more appropriate ones, like \p{Cased} -- except that doesn't
guarantee that the thing will change (in case you are).  These however do
(from PropertyAliases.txt):

    CWCF      ; Changes_When_Casefolded
    CWCM      ; Changes_When_Casemapped
    CWKCF     ; Changes_When_NFKC_Casefolded
    CWL       ; Changes_When_Lowercased
    CWT       ; Changes_When_Titlecased
    CWU       ; Changes_When_Uppercased

So you could do any of a bunch of things:

    s/(\p{Cased}+)/\u\L$1/g;
    s/(\p{CWT}\p{CWL}+)/\u\L$1/g
    s/(\p{CWT})(\p{CWL}+)/\u$1\L$2/g;

In practice some \b boundaries might be a good idea there.

You really have quite a lot of flexibility when you have all 
the Unicode properties available to you.

I don't know how you're going to get the properties into Java.
You have a problem already at Level 1, which doesn't require very 
many.  What you'll do when you get to the rest, I don't quite know, but
I think you will have to choose some sort of prefix for the properties
whose names you have already defined in a way that conflicts with the 
Unicode definition. Maybe a leading "U"?  Since underscores don't (well,
aren't *supposed* to) count, that could just be:

    \p{U_Space}
    \p{U_Alpha}
    \p{U_Lower}

etc.   There is a proposed revision to tr18 that outlines
this path toward compliance as a perfectly valid one.

    http://unicode.org/reports/tr18/proposed.html#Full_Properties

    RL2.7	Full Properties

        To meet this requirement, an implementation shall support all of
        the properties listed below that are in the supported version of
        Unicode, with values that match the Unicode definitions for that
        version.

    As in RL1.2 Properties, in order to meet requirement RL2.7, the
    implementation has to satisfy the Unicode definition of the properties
    for the supported version of Unicode, not other possible definitions.
    However, the names used for the properties might need to be different
    for compatibility. For example, if a regex engine already has
    "Alphabetic", for compatibility it may need a different name, such as
    "Unicode_Alphabetic" for the Unicode property.

    The list excludes contributed properties, obsolete and deprecated
    properties, and the Unicode 1 Name and Unicode Radical Stroke
    properties. The properties in gray are covered by RL1.2 Properties.

It seems to me that you might be going to need this for RL1.2, also, since
you have definitions for the POSIX properties that don't match what RL1.2a
says they should.

In Perl, we split of the [:upper:] things from the \p{upper} things so that
we could be strictly POSIXy on the former but fully compliant with tr18 on 
the latter.  in Java you don't have the former syntax available, and your
version of the latter syntax is "wrong". 

This is just part of my fixing up j.l.Pattern docs will take longer.
Mostly I want to fix the things it says about Perl that are wrong.
Some of those are wrong because they're outdated, and some are wrong
because they were never true.

Do you think I should use 5.12 as the version of Perl compared against, 
or should I use 5.14 (which is in late RC0) because it is the one that
used Unicode 6.0 and so would match JDK7?

--tom


More information about the i18n-dev mailing list