<i18n dev> Errors in Java casing

Wed Aug 10 11:39:38 PDT 2011

I've discovered some errors in Java's case insensitive methods
for its String class.  Its equalsIgnoreCase() is the most
obvious one that gets things wrong, but there are several
others as well.

There is inarguably at least one significant bug, and quite plausibly
several others as well.  I've looked at the JDK7 source, and these remain
buggy.  I enclose a testing program to illustrate these bugs.  It 
runs the tests against both JDK and the equivalent ICU function.
The JDK gets many of them wrong, while ICU gets them all correct.

--tom

==TECHNICAL DETAILS FOLLOW==

The source of these bug is the many, many ASCII assumptions regarding
casing, assumptions that do not hold for Unicode.  There are also holdover
bugs due to ignorance about Unicode outside the BMP that come from Unicode
1's 16-bitness, which no longer applies.

The easiest bug to illustrate is that 

    "��������������".equalsIgnoreCase("��������������")

erroneously returns false when Unicode demands that it return to true.
The bug here is that regionMatches() thinks that a String comprises
a sequence of 16-bit Unicode characters.  It does not.  Those are char
units, not Unicode characters, which are 21-bit quanties normally rounded
up to 32 bits, not down to 16.  The example strings I just used are 
in Deseret, which is a case-changing script in the SMP not in the BMP.
Therefore, the char-based monkey work is broken.  You can see how broken
this is right here, from regionMatches:

        while (len-- > 0) {
            char c1 = ta[to++];
            char c2 = pa[po++];
            if (c1 == c2) {
                continue;
            }
            if (ignoreCase) {
                // If characters don't match but case may be ignored,
                // try converting both characters to uppercase.
                // If the results match, then the comparison scan should
                // continue.
                char u1 = Character.toUpperCase(c1);
                char u2 = Character.toUpperCase(c2);
                if (u1 == u2) {
                    continue;
                }
                // Unfortunately, conversion to uppercase does not work properly
                // for the Georgian alphabet, which has strange rules about case
                // conversion.  So we need to make one last check before
                // exiting.
                if (Character.toLowerCase(u1) == Character.toLowerCase(u2)) {
                    continue;
                }
            }

That is the source of this bug, and several others I shall describe.  It
needs to be reworked so that it actually works with all Unicode strings.
You cannot store Unicode characters in Java char variables, and you must
not call the Character casing methods since they don't work for most of
Unicode's range.

Next, if you are comparing strings, you should not be using simple case
maps, and you must not assume that strings don't change length when
casemapped or folded, because they quite obviously do.  You can't let
equalsIgnoreCase()  short-circuit to failure because the strings have 
different lengths. That is an ASCII mindset for characters. It
is contrary to a Unicode mindset for strings.

Plus you are supposed to be using (full) case*folds* not casemaps, which 
are quite different from one another.    Here is an example:

    Original:               weiß            WEIẞ

    Simple casemaps for chars:
        lowercase           weiß            weiß
        upppercase          WEIß            WEIẞ

    Full casemaps for strings:
        lowercase           weiß            weiß
        uppercase           WEISS           WEIẞ

    Casefolds:
        fold simple         weiß            weiß
        fold full           weiss           weiss

The final line is the most important, which shows that they are the
same because they have the same casefold.  Q.E.D.

To compare whether two strings are the same without respect to case, you
must first calculate their respective Unicode casefolds (not casemaps!) 
and then compare those.   You must not compare either the original strings 
or casemaps generated from those, as both of those can give wrong answers.
You must use casefolds.

But there is nothing in String or Character that gives you the casefold.
There needs to be.  Character should provide the simple casefold and 
String should provide the full casefold.  (I don't know what to do about
locales and turkish casefolds.)

Demo program enclosed.  I compare results from Java's
String.equalIgnoreCase() with results from ICU's
CaseInsensitiveString.equals().

Make sure your classpath has the (current) ICU library in it,
and make sure to compile with "javac -encoding UTF-8".

Hope this helps.

--tom

-------------- next part --------------
import java.lang.System;
import java.io.*;

import com.ibm.icu.util.CaseInsensitiveString;

public class weiss { 

    private static BufferedReader  stdin;
    private static PrintStream     stdout, stderr;

    public static void eqtest(String s1, String s2) {
        CaseInsensitiveString si1 = new CaseInsensitiveString(s1);
        CaseInsensitiveString si2 = new CaseInsensitiveString(s2);

        stdout.printf("%s: Java %s equals %s\n",
            s1.equalsIgnoreCase(s2) ? "pass" : "FAIL", s1, s2);

        stdout.printf("%s: ICU  %s equals %s\n\n", 
            si1.equals(s2) ? "pass" : "FAIL", s1, s2);

    } 

    public static void main(String argv[]) { 

        try { 
            stdin  = new BufferedReader(new InputStreamReader(System.in, "UTF-8"));
            stdout = new PrintStream(System.out, true, "UTF-8");
            stderr = new PrintStream(System.err, true, "UTF-8");
        } catch (IOException hosed) {
            System.err.printf("%s: error setting std streams to UTF-8: %s.\n",
                hosed.getMessage());
            System.exit(1);
        } 

        eqtest("ẛ", "Ṡ"); // "\N{LATIN SMALL LETTER LONG S WITH DOT ABOVE}", "\N{LATIN CAPITAL LETTER S WITH DOT ABOVE}"
        eqtest("µ", "Μ"); // "\N{MICRO SIGN}", "\N{GREEK CAPITAL LETTER MU}"
        eqtest("µ", "μ"); // "\N{MICRO SIGN}", "\N{GREEK SMALL LETTER MU}"

        eqtest("ƦᴀƦᴇ", "ʀᴀʀᴇ"); // "\N{LATIN LETTER YR}\N{LATIN LETTER SMALL CAPITAL A}\N{LATIN LETTER YR}\N{LATIN LETTER SMALL CAPITAL E}", "\N{LATIN LETTER SMALL CAPITAL R}\N{LATIN LETTER SMALL CAPITAL A}\N{LATIN LETTER SMALL CAPITAL R}\N{LATIN LETTER SMALL CAPITAL E}"

        eqtest("eﬃcient", "EFFICIENT"); // "e\N{LATIN SMALL LIGATURE FFI}cient", "EFFICIENT"

        eqtest("ﬂour and water", "FLOUR AND WATER");    // "ﬂour and water", "FLOUR AND WATER"

        eqtest("I WORK AT Ⓚ", "i work at ⓚ");   // "I WORK AT \N{CIRCLED LATIN CAPITAL LETTER K}", "i work at \N{CIRCLED LATIN SMALL LETTER K}"
        eqtest("HENRY Ⅷ", "henry ⅷ");           // "HENRY \N{ROMAN NUMERAL EIGHT}", "henry \N{SMALL ROMAN NUMERAL EIGHT}"

        // Classic German ligature issues
        eqtest("tschüß", "TSCHÜSS");    // "tsch\N{LATIN SMALL LETTER U WITH DIAERESIS}\N{LATIN SMALL LETTER SHARP S}", "TSCH\N{LATIN CAPITAL LETTER U WITH DIAERESIS}SS"
        // the capital version is from Unicode 5.1, which is very old now
        eqtest("weiß", "WEIẞ");         // "wei\N{LATIN SMALL LETTER SHARP S}", "WEI\N{LATIN CAPITAL LETTER SHARP S}"
        eqtest("weiß", "WEISS");         // "wei\N{LATIN SMALL LETTER SHARP S}", "WEISS"
        eqtest("weiss", "WEIẞ");         // "weiss", "WEI\N{LATIN CAPITAL LETTER SHARP S}"

        // English ligature issues
        eqtest("poſt", "post");         // "po\N{LATIN SMALL LETTER LONG S}t", "post"
        eqtest("poﬅ", "post");          // "po\N{LATIN SMALL LIGATURE LONG S T}", "post"

        // Deseret is only non-BMP case changing scdript
        eqtest("��������������", "��������������");   // "\N{DESERET CAPITAL LETTER DEE}\N{DESERET CAPITAL LETTER SHORT E}\N{DESERET CAPITAL LETTER ES}\N{DESERET CAPITAL LETTER LONG I}\N{DESERET CAPITAL LETTER ER}\N{DESERET CAPITAL LETTER SHORT E}\N{DESERET CAPITAL LETTER TEE}", "\N{DESERET SMALL LETTER DEE}\N{DESERET SMALL LETTER SHORT E}\N{DESERET SMALL LETTER ES}\N{DESERET SMALL LETTER LONG I}\N{DESERET SMALL LETTER ER}\N{DESERET SMALL LETTER SHORT E}\N{DESERET SMALL LETTER TEE}"

        // Greek simple casefolding tests
        eqtest("στιγμας", "στιγμασ");   // "\N{GREEK SMALL LETTER SIGMA}\N{GREEK SMALL LETTER TAU}\N{GREEK SMALL LETTER IOTA}\N{GREEK SMALL LETTER GAMMA}\N{GREEK SMALL LETTER MU}\N{GREEK SMALL LETTER ALPHA}\N{GREEK SMALL LETTER FINAL SIGMA}", "\N{GREEK SMALL LETTER SIGMA}\N{GREEK SMALL LETTER TAU}\N{GREEK SMALL LETTER IOTA}\N{GREEK SMALL LETTER GAMMA}\N{GREEK SMALL LETTER MU}\N{GREEK SMALL LETTER ALPHA}\N{GREEK SMALL LETTER SIGMA}"
        eqtest("στιγμας", "ΣΤΙΓΜΑΣ");   // "\N{GREEK SMALL LETTER SIGMA}\N{GREEK SMALL LETTER TAU}\N{GREEK SMALL LETTER IOTA}\N{GREEK SMALL LETTER GAMMA}\N{GREEK SMALL LETTER MU}\N{GREEK SMALL LETTER ALPHA}\N{GREEK SMALL LETTER FINAL SIGMA}", "\N{GREEK CAPITAL LETTER SIGMA}\N{GREEK CAPITAL LETTER TAU}\N{GREEK CAPITAL LETTER IOTA}\N{GREEK CAPITAL LETTER GAMMA}\N{GREEK CAPITAL LETTER MU}\N{GREEK CAPITAL LETTER ALPHA}\N{GREEK CAPITAL LETTER SIGMA}"

        // Greek full casefolding tests
        eqtest("ᾲ", "ᾺΙ");              // "\N{GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI}", "\N{GREEK CAPITAL LETTER ALPHA WITH VARIA}\N{GREEK CAPITAL LETTER IOTA}"
        eqtest("ᾲ στο διάολο", "Ὰͅ Στο Διάολο"); // "\N{GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI} \N{GREEK SMALL LETTER SIGMA}\N{GREEK SMALL LETTER TAU}\N{GREEK SMALL LETTER OMICRON} \N{GREEK SMALL LETTER DELTA}\N{GREEK SMALL LETTER IOTA}\N{GREEK SMALL LETTER ALPHA WITH TONOS}\N{GREEK SMALL LETTER OMICRON}\N{GREEK SMALL LETTER LAMDA}\N{GREEK SMALL LETTER OMICRON}", "\N{GREEK CAPITAL LETTER ALPHA WITH VARIA}\N{COMBINING GREEK YPOGEGRAMMENI} \N{GREEK CAPITAL LETTER SIGMA}\N{GREEK SMALL LETTER TAU}\N{GREEK SMALL LETTER OMICRON} \N{GREEK CAPITAL LETTER DELTA}\N{GREEK SMALL LETTER IOTA}\N{GREEK SMALL LETTER ALPHA WITH TONOS}\N{GREEK SMALL LETTER OMICRON}\N{GREEK SMALL LETTER LAMDA}\N{GREEK SMALL LETTER OMICRON}"
        eqtest("ᾲ στο διάολο", "ᾺΙ ΣΤΟ ΔΙΆΟΛΟ"); // "\N{GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI} \N{GREEK SMALL LETTER SIGMA}\N{GREEK SMALL LETTER TAU}\N{GREEK SMALL LETTER OMICRON} \N{GREEK SMALL LETTER DELTA}\N{GREEK SMALL LETTER IOTA}\N{GREEK SMALL LETTER ALPHA WITH TONOS}\N{GREEK SMALL LETTER OMICRON}\N{GREEK SMALL LETTER LAMDA}\N{GREEK SMALL LETTER OMICRON}", "\N{GREEK CAPITAL LETTER ALPHA WITH VARIA}\N{GREEK CAPITAL LETTER IOTA} \N{GREEK CAPITAL LETTER SIGMA}\N{GREEK CAPITAL LETTER TAU}\N{GREEK CAPITAL LETTER OMICRON} \N{GREEK CAPITAL LETTER DELTA}\N{GREEK CAPITAL LETTER IOTA}\N{GREEK CAPITAL LETTER ALPHA WITH TONOS}\N{GREEK CAPITAL LETTER OMICRON}\N{GREEK CAPITAL LETTER LAMDA}\N{GREEK CAPITAL LETTER OMICRON}"

        // Unicode 6.0.0 case-changing code point
        eqtest("ԦԦ", "ԧԧ");     // "\N{CYRILLIC CAPITAL LETTER SHHA WITH DESCENDER}\N{CYRILLIC CAPITAL LETTER SHHA WITH DESCENDER}", "\N{CYRILLIC SMALL LETTER SHHA WITH DESCENDER}\N{CYRILLIC SMALL LETTER SHHA WITH DESCENDER}"

    }

}