<i18n dev> Java encoder errors

Mon Sep 19 14:41:49 PDT 2011

I agree with the first part, disallowing the irregular code sequences.

As to the noncharacters, it would be a horrible mistake to disallow them.

Tom, a Java code converter is far too low a level for C9; if the converter
can't handle them, it screws up all perfectly legitimate
*internal*interchange. C9 is really for a very high level, eg don't
put them into
interchanged plain text, like a web page. I agree that it needs more
clarification.

Mark
*— Il meglio è l’inimico del bene —*
*
*
*
[https://plus.google.com/114199149796022210033]
*

On Mon, Sep 19, 2011 at 14:35, Xueming Shen <xueming.shen at oracle.com> wrote:

> Tom,
>
> Very good timing:-)  I'm back to my encoding related bugs just fixing some
> corner cases
> in the new UTF-8 implementation we putback in for JDK7.
>
> The surrogates part is a known issue. Unicode Standard can simply change
> its "terms" [1] and
> announce "the irregular code unit sequence is no longer needed", go use
> CESU-8 if you have
> to deal with it. It's not that easy for a platform/implementation that
> takes compatibility very
> serious, such as Java, to simply break the compatibility to follow. The
> current implementation
> still accepts the surrogates but never generates them. I'm not that firm on
> this, if everybody
> agrees that after so many years, compatibility for irregular utf-8 byte
> sequence is no longer
> a concern, we definitely can follow the "conformance request", especially
> it appears from 4.0
> Unicode Standard clearly declares sequence mapped to surrogate are
> "ill-formed'. We just
> need more voice on this issue.
>
> I'm not sure, however, regarding the "forbidden noncharacters". The
> ch03/D92 appears to be
> fine (not explicitly forbid) to do conversion between different Unicode
> encoding forms for
> these non-character. Personally I don't see any benefit of not allowing it.
> I think we have lots
> of Unicode expert:-) on this mailing list, what's the "official words" on
> this issue. But again, I
> doubt the Java UTF-8 can then simply drop these code points from the
> UTF16<->UTF8
> conversion.
>
> -Sherman
>
> [1]http://www.unicode.org/**reports/tr28/tr28-3.html#3_1_**conformance<http://www.unicode.org/reports/tr28/tr28-3.html#3_1_conformance>
>
>
> On 09/19/2011 11:45 AM, Tom Christiansen wrote:
>
>> Does anybody know anything about the Java UTF-8 encoder?  It seems to be
>> broken
>> in a couple (actually, three) of ways.
>>
>>   * First, it allows for intermixed CESU-8 and UTF-8 even though you
>>     specify UTF-8, when it should be throwing an exception on the CESU-8.
>>     It also allows unpaired surrogates, which is also forbidden by the
>> standard.
>>
>>   * Second, it allows in the 66 noncharacter code points that the Unicode
>>     Standard says "shall not" be used.
>>
>> The charset encoders and decoders tend to be a bit finicky on whether they
>> throw proper exceptions or not, so I'll show you exactly what I'm using:
>>
>>     import java.io.*;
>>     import java.nio.charset.Charset;
>>     public class utf8test {
>>          public static void main(String argv[])
>>             throws IOException
>>          {
>>              BufferedReader stdin = new BufferedReader(new
>> InputStreamReader(System.in, Charset.forName("UTF-8").**newDecoder()));
>>              PrintWriter stdout = new PrintWriter(new
>> OutputStreamWriter(System.out, Charset.forName("UTF-8").**newEncoder()),
>> true);
>>              String line;
>>              while ((line = stdin.readLine()) != null) {
>>                 stdout.println(line);
>>                 for (int i = 0; i<  line.length(); i++) {  // XXX: not the
>> real code point length!
>>                     int cp = line.codePointAt(i);
>>                     if (cp<  32 || cp>  126) {
>>                         stdout.printf("\\x{%05X}", cp);
>>                     } else {
>>                         stdout.printf("%c", cp);
>>                     }
>>                     if (cp>  Character.MAX_VALUE) {
>>                         i++; // correct for code unit != code point
>>                     }
>>                 }
>>                 stdout.printf("\n");
>>              }
>>         }
>>     }
>>
>> I can get that code to raise an exception by feeding it purported UTF-8
>> that is:
>>
>>    1. Invalid because it has the wrong bit pattern (eg, 0xE9 by itself).
>>    2. Invalid because it has a non-shortest encoding error (eg, \xC0\x80
>> instead of \x00).
>>
>> However, I cannot get it to raise an exception by feeding it purported
>> UTF-8 that has:
>>
>>    3. Invalid because it has surrogates in it, unpaired or paired.
>>
>>       3a. unpaired example: \xED\xB0\x80, which would be the UTF-8
>> encoding of
>>           surrogate U+DC00.  Surrogates are not allowed.
>>
>>       3b. paired example: \xED\xA0\x80\xED\xB0\x80, which would be CESU-8
>>           encoding of code point U+10000; the correct UTF-8 is
>> \xF0\x90\x80\x80.
>>           In fact, if you feed Java both \xED\xA0\x80\xED\xB0\x80 and
>>           \xF0\x90\x80\x80, Java quietly treats that as two U+10000 code
>> points.
>>           It should not be doing that; it should be raising an exception.
>>
>>    4. Invalid because it has one of the 66 forbidden noncharacters in it.
>>
>> The 66 noncharacter code points are the 32 code points between U+FDD0 and
>> U+FDEF, and the 34 code points U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ...
>> U+10FFFE, U+10FFFF.  Here's something from the published Unicode
>> Standard's
>> p.24 about noncharacter code points:
>>
>>     • Noncharacter code points are reserved for internal use, such as for
>>       sentinel values. They should never be interchanged. They do,
>> however,
>>       have well-formed representations in Unicode encoding forms and
>> survive
>>       conversions between encoding forms. This allows sentinel values to
>> be
>>       preserved internally across Unicode encoding forms, even though they
>> are
>>       not designed to be used in open interchange.
>>
>> And here is more about this matter from the Unicode Standard's chapter on
>> Conformance, section 3.2, p. 59:
>>
>>     C2 A process shall not interpret a noncharacter code point as an
>>        abstract character.
>>
>>         • The noncharacter code points may be used internally, such as for
>>           sentinel values or delimiters, but should not be exchanged
>> publicly.
>>
>> That certainly looks to me that by that description, Java is
>> non-conformant
>> because of what it does for 3a, 3b, and 4.
>>
>> Does anyone know anything about this?
>>
>> thanks,
>>
>> --tom
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/i18n-dev/attachments/20110919/5087cce3/attachment-0001.html