<i18n dev> Errors in Java casing

Tom Christiansen tchrist at perl.com
Fri Aug 12 12:07:18 PDT 2011


Just a quick followup.  This bug with equalsIgnoreCase working only for
BMP alone went undetected all the way up through Unicode 3.1.  That's
when the Deseret script was introduced, which is a case-changing script
outside the BMP.  That was more than 10 years ago now.  Obviously no one
is screaming about it, but we never know what will happen in the future,
and there is no reason for Java to misbehave on applicable future code
points that are someday added outside the BMP.  Best to future-proof it.

Apparently there was never any organized code inspection to check all
core Java libraries to fix anything processing Strings in a char-wise
fashion to do so in by code points unless it really and truly made no
difference, which in this case it does.  That surprises me. This would
also have been caught by an extensive test suite that tried all code
points for various things.

Even processing strings by code point doesn't give the best results.  I'd
rather like to see a way to disregard lengths and instead compare the two
strings' full casefolds instead.  However, I recognize that that has
performance impacts at the very least and perhaps compatibility ones as
well, so arguably a new and different method might be a more appropriate
solution if that route were deemed sufficiently desirable.

The problem is that this is unreasonably hard to implement on one's own
without a method that produces a string's casefold.  Because of this, I
believe Java needs a String method that returns the full casefold of that
string, and perhaps for performance concerns also a Character method that
takes a code point and returns its simple casefold only.

I don't know how locales enter into that, either.  There is room in 
casefolding rules for locale stuff like Turkic, since that gets a 
different (full) casefold in that locale.  

--tom


More information about the i18n-dev mailing list