<i18n dev> Review request: 7037261: j.l.Character.isLowerCase/isUpperCase need to match the Unicode Standard definition

Tue Apr 19 17:22:24 PDT 2011

  Hi

Tom Christiansen recently contributed a API doc update [1] for 
j.l.Character, as the followup for the
Unicode support discussion in j.l.Character/j.u.regex we had back to 
January. In his doc patch, Tom
recommended to "downgrade" the doc for 
j.l.Character.isLowCase/UpperCase(char/int) methods from
"character"  to "letter" to accurately describe/specify what we 
currently really do in j.l.c class, because
current j.l.c API spec and implementation for these methods are in fact 
only about "letter", solely base
on whether the general category type of the character is 
LOW/UPPERCASE_LETTER to decide if the
character is lowercase or uppercase. While the Unicode Standard clearly 
specifies its definition of
lowercase/uppercase of a  character as GC=Lu/Ll + Other_Lower/Uppercase 
in ch04/4.2 Case [2]. As
the result of this difference the j.l.Character.isLowerCase/UpperCase() 
methods don't work correctly
for all Unicode Other_Lowercase/Uppercase characters (201 of them, as in 
Unicode 6.0) .

I totally agree with Tom on this his check. But instead of updating the 
j.l.c document to describe
the difference between Java spec/implementation and the Unicode Standard 
definition in JDK 7 and
leave the real solution to JDK8  (given we all agree this is something 
we need to address in future
release, if we don't address it now), personally I prefer to address the 
issue in one step, to update both
the spec and implementation of these methods to match the Unicode 
Standard definition in JDK7, if we
can manage to squeeze this in at this very late stage of the release. It 
appears Tom also prefers this
approach as well, if it is achievable.

So here is the webrev

http://cr.openjdk.java.net/~sherman/7037261/webrev

Other than these 4 isLowerCase/UpperCase() methods, We also proposed to 
add two new methods
to support another two important Unicode character properties, 
isAlphabetic/isIdeographic, which
are specified in Unicode Standard ch04/4.2 Case/4.11 [2] and defined in 
tr44[3][4]

Given the "incompatible" nature of the request (these 4 methods change 
behavior for those 201
code points), this proposal is under CCC review. Whether or not this 
request can make itself into
JDK7 depends on the CCC review result and whether the review can be 
finished before the final
cutoff schedule of JDK7.

Thanks,
Sherman

[1] http://mail.openjdk.java.net/pipermail/i18n-dev/2011-April/000358.html
[2] http://www.unicode.org/versions/Unicode6.0.0/ch04.pdf
[3] http://www.unicode.org/reports/tr44/#Alphabetic
[4] http://www.unicode.org/reports/tr44/#Ideographic