<i18n dev> Java encoder errors
Mark Davis ☕
mark at macchiato.com
Mon Sep 19 14:41:49 PDT 2011
I agree with the first part, disallowing the irregular code sequences.
As to the noncharacters, it would be a horrible mistake to disallow them.
Tom, a Java code converter is far too low a level for C9; if the converter
can't handle them, it screws up all perfectly legitimate
*internal*interchange. C9 is really for a very high level, eg don't
put them into
interchanged plain text, like a web page. I agree that it needs more
clarification.
Mark
*— Il meglio è l’inimico del bene —*
*
*
*
[https://plus.google.com/114199149796022210033]
*
On Mon, Sep 19, 2011 at 14:35, Xueming Shen <xueming.shen at oracle.com> wrote:
> Tom,
>
> Very good timing:-) I'm back to my encoding related bugs just fixing some
> corner cases
> in the new UTF-8 implementation we putback in for JDK7.
>
> The surrogates part is a known issue. Unicode Standard can simply change
> its "terms" [1] and
> announce "the irregular code unit sequence is no longer needed", go use
> CESU-8 if you have
> to deal with it. It's not that easy for a platform/implementation that
> takes compatibility very
> serious, such as Java, to simply break the compatibility to follow. The
> current implementation
> still accepts the surrogates but never generates them. I'm not that firm on
> this, if everybody
> agrees that after so many years, compatibility for irregular utf-8 byte
> sequence is no longer
> a concern, we definitely can follow the "conformance request", especially
> it appears from 4.0
> Unicode Standard clearly declares sequence mapped to surrogate are
> "ill-formed'. We just
> need more voice on this issue.
>
> I'm not sure, however, regarding the "forbidden noncharacters". The
> ch03/D92 appears to be
> fine (not explicitly forbid) to do conversion between different Unicode
> encoding forms for
> these non-character. Personally I don't see any benefit of not allowing it.
> I think we have lots
> of Unicode expert:-) on this mailing list, what's the "official words" on
> this issue. But again, I
> doubt the Java UTF-8 can then simply drop these code points from the
> UTF16<->UTF8
> conversion.
>
> -Sherman
>
> [1]http://www.unicode.org/**reports/tr28/tr28-3.html#3_1_**conformance<http://www.unicode.org/reports/tr28/tr28-3.html#3_1_conformance>
>
>
> On 09/19/2011 11:45 AM, Tom Christiansen wrote:
>
>> Does anybody know anything about the Java UTF-8 encoder? It seems to be
>> broken
>> in a couple (actually, three) of ways.
>>
>> * First, it allows for intermixed CESU-8 and UTF-8 even though you
>> specify UTF-8, when it should be throwing an exception on the CESU-8.
>> It also allows unpaired surrogates, which is also forbidden by the
>> standard.
>>
>> * Second, it allows in the 66 noncharacter code points that the Unicode
>> Standard says "shall not" be used.
>>
>> The charset encoders and decoders tend to be a bit finicky on whether they
>> throw proper exceptions or not, so I'll show you exactly what I'm using:
>>
>> import java.io.*;
>> import java.nio.charset.Charset;
>> public class utf8test {
>> public static void main(String argv[])
>> throws IOException
>> {
>> BufferedReader stdin = new BufferedReader(new
>> InputStreamReader(System.in, Charset.forName("UTF-8").**newDecoder()));
>> PrintWriter stdout = new PrintWriter(new
>> OutputStreamWriter(System.out, Charset.forName("UTF-8").**newEncoder()),
>> true);
>> String line;
>> while ((line = stdin.readLine()) != null) {
>> stdout.println(line);
>> for (int i = 0; i< line.length(); i++) { // XXX: not the
>> real code point length!
>> int cp = line.codePointAt(i);
>> if (cp< 32 || cp> 126) {
>> stdout.printf("\\x{%05X}", cp);
>> } else {
>> stdout.printf("%c", cp);
>> }
>> if (cp> Character.MAX_VALUE) {
>> i++; // correct for code unit != code point
>> }
>> }
>> stdout.printf("\n");
>> }
>> }
>> }
>>
>> I can get that code to raise an exception by feeding it purported UTF-8
>> that is:
>>
>> 1. Invalid because it has the wrong bit pattern (eg, 0xE9 by itself).
>> 2. Invalid because it has a non-shortest encoding error (eg, \xC0\x80
>> instead of \x00).
>>
>> However, I cannot get it to raise an exception by feeding it purported
>> UTF-8 that has:
>>
>> 3. Invalid because it has surrogates in it, unpaired or paired.
>>
>> 3a. unpaired example: \xED\xB0\x80, which would be the UTF-8
>> encoding of
>> surrogate U+DC00. Surrogates are not allowed.
>>
>> 3b. paired example: \xED\xA0\x80\xED\xB0\x80, which would be CESU-8
>> encoding of code point U+10000; the correct UTF-8 is
>> \xF0\x90\x80\x80.
>> In fact, if you feed Java both \xED\xA0\x80\xED\xB0\x80 and
>> \xF0\x90\x80\x80, Java quietly treats that as two U+10000 code
>> points.
>> It should not be doing that; it should be raising an exception.
>>
>> 4. Invalid because it has one of the 66 forbidden noncharacters in it.
>>
>> The 66 noncharacter code points are the 32 code points between U+FDD0 and
>> U+FDEF, and the 34 code points U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ...
>> U+10FFFE, U+10FFFF. Here's something from the published Unicode
>> Standard's
>> p.24 about noncharacter code points:
>>
>> • Noncharacter code points are reserved for internal use, such as for
>> sentinel values. They should never be interchanged. They do,
>> however,
>> have well-formed representations in Unicode encoding forms and
>> survive
>> conversions between encoding forms. This allows sentinel values to
>> be
>> preserved internally across Unicode encoding forms, even though they
>> are
>> not designed to be used in open interchange.
>>
>> And here is more about this matter from the Unicode Standard's chapter on
>> Conformance, section 3.2, p. 59:
>>
>> C2 A process shall not interpret a noncharacter code point as an
>> abstract character.
>>
>> • The noncharacter code points may be used internally, such as for
>> sentinel values or delimiters, but should not be exchanged
>> publicly.
>>
>> That certainly looks to me that by that description, Java is
>> non-conformant
>> because of what it does for 3a, 3b, and 4.
>>
>> Does anyone know anything about this?
>>
>> thanks,
>>
>> --tom
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/i18n-dev/attachments/20110919/5087cce3/attachment-0001.html
More information about the i18n-dev
mailing list