<i18n dev> Java encoder errors

Mon Sep 19 11:45:18 PDT 2011

Does anybody know anything about the Java UTF-8 encoder?  It seems to be broken
in a couple (actually, three) of ways.  

  * First, it allows for intermixed CESU-8 and UTF-8 even though you
    specify UTF-8, when it should be throwing an exception on the CESU-8.
    It also allows unpaired surrogates, which is also forbidden by the standard.

  * Second, it allows in the 66 noncharacter code points that the Unicode
    Standard says "shall not" be used.

The charset encoders and decoders tend to be a bit finicky on whether they 
throw proper exceptions or not, so I'll show you exactly what I'm using:

    import java.io.*;
    import java.nio.charset.Charset;
    public class utf8test {
         public static void main(String argv[])
            throws IOException
         {
             BufferedReader stdin = new BufferedReader(new InputStreamReader(System.in, Charset.forName("UTF-8").newDecoder()));
             PrintWriter stdout = new PrintWriter(new OutputStreamWriter(System.out, Charset.forName("UTF-8").newEncoder()), true);
             String line;
             while ((line = stdin.readLine()) != null) {
                stdout.println(line);
                for (int i = 0; i < line.length(); i++) {  // XXX: not the real code point length!
                    int cp = line.codePointAt(i);
                    if (cp < 32 || cp > 126) {
                        stdout.printf("\\x{%05X}", cp);
                    } else {
                        stdout.printf("%c", cp);
                    }
                    if (cp > Character.MAX_VALUE) {
                        i++; // correct for code unit != code point
                    }
                }
                stdout.printf("\n");
             }
        }
    }

I can get that code to raise an exception by feeding it purported UTF-8 that is:

   1. Invalid because it has the wrong bit pattern (eg, 0xE9 by itself).
   2. Invalid because it has a non-shortest encoding error (eg, \xC0\x80 instead of \x00).

However, I cannot get it to raise an exception by feeding it purported UTF-8 that has:

   3. Invalid because it has surrogates in it, unpaired or paired.

      3a. unpaired example: \xED\xB0\x80, which would be the UTF-8 encoding of
          surrogate U+DC00.  Surrogates are not allowed.

      3b. paired example: \xED\xA0\x80\xED\xB0\x80, which would be CESU-8
          encoding of code point U+10000; the correct UTF-8 is \xF0\x90\x80\x80.
          In fact, if you feed Java both \xED\xA0\x80\xED\xB0\x80 and
          \xF0\x90\x80\x80, Java quietly treats that as two U+10000 code points.
          It should not be doing that; it should be raising an exception.

   4. Invalid because it has one of the 66 forbidden noncharacters in it.

The 66 noncharacter code points are the 32 code points between U+FDD0 and
U+FDEF, and the 34 code points U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ...
U+10FFFE, U+10FFFF.  Here's something from the published Unicode Standard's
p.24 about noncharacter code points:

    • Noncharacter code points are reserved for internal use, such as for 
      sentinel values. They should never be interchanged. They do, however,
      have well-formed representations in Unicode encoding forms and survive
      conversions between encoding forms. This allows sentinel values to be
      preserved internally across Unicode encoding forms, even though they are
      not designed to be used in open interchange.

And here is more about this matter from the Unicode Standard's chapter on 
Conformance, section 3.2, p. 59: 

    C2 A process shall not interpret a noncharacter code point as an 
       abstract character.

        • The noncharacter code points may be used internally, such as for 
          sentinel values or delimiters, but should not be exchanged publicly.

That certainly looks to me that by that description, Java is non-conformant 
because of what it does for 3a, 3b, and 4.

Does anyone know anything about this?

thanks,

--tom