<i18n dev> Codereview request for 7096080: UTF8 update and new CESU-8 charset

Wed Sep 28 20:27:13 PDT 2011

Hi,

On 9/28/2011 3:44 PM, Ulf Zibis wrote:
> Hi Sherman,
>
> 1. bug 7096080 is not visible at 
> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7096080

It might take couple days for it to show up on bugs.sun.com. But it has 
exactly the same content as
my previous email. In fact I simply copy/pasted them into email.

> 3. Consider additionally 6795537 - UTF_8$Decoder returns wrong results 
> <http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6795537>
>

(1) new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42} ---> 
CoderResult.malformedForLength(1)
It appears the Unicode Standard now explicitly recommends to return the 
malformed length 2,
what UTF-8 is doing now, for this scenario

(2)  new byte[]{(byte)0xE1, (byte)0x40 ---> 
CoderResult.malformedForLength(1)
The change proposed actually fixed this one already (malformed length 1

(3) new byte[]{(byte)0xC0} ---> CoderResult.malformedForLength(1)
Technically this is not a bug,  the decoder will return malformedlength 
1 if you go with
decode(bf,cf, true). But yes, it would be desirable to return malformed 
length 1 without
waiting for second byte. The code/webrev has been updated to just do 
this as "expected".

Now the 2-byte sequence entry check has been updated to
} else if ((b1 >> 5) == -2 && (b1 & 0x1e) != 0) {
...
}

and I no longer check the first byte for malformed2(),

in which I think has the smallest performance impact for 2 bytes 
sequence. I ran several
rounds of benchmark testing, I did not see significant difference. I 
will try more later.

I'm not sure I  understand the suggested  b1 < -0x3e patch, I don't see 
we can simply replace
((b1 >> 5) == -2) with (b1 < -0x3e).

Anyway, I hope now you are motivated to take a deep look at the code:-) 
and maybe want to
run all your tests to confirm the change is fine.

This change does expose an existing bug/issue in StreamDecoder, in which 
the StreamDecoder fails
to replace a "malformed" input, in which a "leading byte"  is at the end 
of the stream. This is why
I commended the line in Errors. I will file a bug for this one later.

> 5. IMHO charset CESU-8 should be hosted in extended-charsets, 
> otherwise it should be added to java.nio.StandardCharsets
>

We have lots of charsets provided via the "standard charset provider" 
(in rt.jar) but not listed in StandardCharsets.

-Sherman

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/i18n-dev/attachments/20110928/f234c99e/attachment.html