<i18n dev> Java encoder errors

Mon Sep 19 14:35:01 PDT 2011

Tom,

Very good timing:-)  I'm back to my encoding related bugs just fixing 
some corner cases
in the new UTF-8 implementation we putback in for JDK7.

The surrogates part is a known issue. Unicode Standard can simply change 
its "terms" [1] and
announce "the irregular code unit sequence is no longer needed", go use 
CESU-8 if you have
to deal with it. It's not that easy for a platform/implementation that 
takes compatibility very
serious, such as Java, to simply break the compatibility to follow. The 
current implementation
still accepts the surrogates but never generates them. I'm not that firm 
on this, if everybody
agrees that after so many years, compatibility for irregular utf-8 byte 
sequence is no longer
a concern, we definitely can follow the "conformance request", 
especially it appears from 4.0
Unicode Standard clearly declares sequence mapped to surrogate are 
"ill-formed'. We just
need more voice on this issue.

I'm not sure, however, regarding the "forbidden noncharacters". The 
ch03/D92 appears to be
fine (not explicitly forbid) to do conversion between different Unicode 
encoding forms for
these non-character. Personally I don't see any benefit of not allowing 
it. I think we have lots
of Unicode expert:-) on this mailing list, what's the "official words" 
on this issue. But again, I
doubt the Java UTF-8 can then simply drop these code points from the 
UTF16<->UTF8
conversion.

-Sherman

[1]http://www.unicode.org/reports/tr28/tr28-3.html#3_1_conformance

On 09/19/2011 11:45 AM, Tom Christiansen wrote:
> Does anybody know anything about the Java UTF-8 encoder?  It seems to be broken
> in a couple (actually, three) of ways.
>
>    * First, it allows for intermixed CESU-8 and UTF-8 even though you
>      specify UTF-8, when it should be throwing an exception on the CESU-8.
>      It also allows unpaired surrogates, which is also forbidden by the standard.
>
>    * Second, it allows in the 66 noncharacter code points that the Unicode
>      Standard says "shall not" be used.
>
> The charset encoders and decoders tend to be a bit finicky on whether they
> throw proper exceptions or not, so I'll show you exactly what I'm using:
>
>      import java.io.*;
>      import java.nio.charset.Charset;
>      public class utf8test {
>           public static void main(String argv[])
>              throws IOException
>           {
>               BufferedReader stdin = new BufferedReader(new InputStreamReader(System.in, Charset.forName("UTF-8").newDecoder()));
>               PrintWriter stdout = new PrintWriter(new OutputStreamWriter(System.out, Charset.forName("UTF-8").newEncoder()), true);
>               String line;
>               while ((line = stdin.readLine()) != null) {
>                  stdout.println(line);
>                  for (int i = 0; i<  line.length(); i++) {  // XXX: not the real code point length!
>                      int cp = line.codePointAt(i);
>                      if (cp<  32 || cp>  126) {
>                          stdout.printf("\\x{%05X}", cp);
>                      } else {
>                          stdout.printf("%c", cp);
>                      }
>                      if (cp>  Character.MAX_VALUE) {
>                          i++; // correct for code unit != code point
>                      }
>                  }
>                  stdout.printf("\n");
>               }
>          }
>      }
>
> I can get that code to raise an exception by feeding it purported UTF-8 that is:
>
>     1. Invalid because it has the wrong bit pattern (eg, 0xE9 by itself).
>     2. Invalid because it has a non-shortest encoding error (eg, \xC0\x80 instead of \x00).
>
> However, I cannot get it to raise an exception by feeding it purported UTF-8 that has:
>
>     3. Invalid because it has surrogates in it, unpaired or paired.
>
>        3a. unpaired example: \xED\xB0\x80, which would be the UTF-8 encoding of
>            surrogate U+DC00.  Surrogates are not allowed.
>
>        3b. paired example: \xED\xA0\x80\xED\xB0\x80, which would be CESU-8
>            encoding of code point U+10000; the correct UTF-8 is \xF0\x90\x80\x80.
>            In fact, if you feed Java both \xED\xA0\x80\xED\xB0\x80 and
>            \xF0\x90\x80\x80, Java quietly treats that as two U+10000 code points.
>            It should not be doing that; it should be raising an exception.
>
>     4. Invalid because it has one of the 66 forbidden noncharacters in it.
>
> The 66 noncharacter code points are the 32 code points between U+FDD0 and
> U+FDEF, and the 34 code points U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ...
> U+10FFFE, U+10FFFF.  Here's something from the published Unicode Standard's
> p.24 about noncharacter code points:
>
>      • Noncharacter code points are reserved for internal use, such as for
>        sentinel values. They should never be interchanged. They do, however,
>        have well-formed representations in Unicode encoding forms and survive
>        conversions between encoding forms. This allows sentinel values to be
>        preserved internally across Unicode encoding forms, even though they are
>        not designed to be used in open interchange.
>
> And here is more about this matter from the Unicode Standard's chapter on
> Conformance, section 3.2, p. 59:
>
>      C2 A process shall not interpret a noncharacter code point as an
>         abstract character.
>
>          • The noncharacter code points may be used internally, such as for
>            sentinel values or delimiters, but should not be exchanged publicly.
>
> That certainly looks to me that by that description, Java is non-conformant
> because of what it does for 3a, 3b, and 4.
>
> Does anyone know anything about this?
>
> thanks,
>
> --tom