<i18n dev> Codereview request for 7096080: UTF8 update and new CESU-8 charset
Xueming Shen
xueming.shen at oracle.com
Wed Sep 28 20:27:13 PDT 2011
Hi,
On 9/28/2011 3:44 PM, Ulf Zibis wrote:
> Hi Sherman,
>
> 1. bug 7096080 is not visible at
> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7096080
It might take couple days for it to show up on bugs.sun.com. But it has
exactly the same content as
my previous email. In fact I simply copy/pasted them into email.
> 3. Consider additionally 6795537 - UTF_8$Decoder returns wrong results
> <http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6795537>
>
(1) new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42} --->
CoderResult.malformedForLength(1)
It appears the Unicode Standard now explicitly recommends to return the
malformed length 2,
what UTF-8 is doing now, for this scenario
(2) new byte[]{(byte)0xE1, (byte)0x40 --->
CoderResult.malformedForLength(1)
The change proposed actually fixed this one already (malformed length 1
(3) new byte[]{(byte)0xC0} ---> CoderResult.malformedForLength(1)
Technically this is not a bug, the decoder will return malformedlength
1 if you go with
decode(bf,cf, true). But yes, it would be desirable to return malformed
length 1 without
waiting for second byte. The code/webrev has been updated to just do
this as "expected".
Now the 2-byte sequence entry check has been updated to
} else if ((b1 >> 5) == -2 && (b1 & 0x1e) != 0) {
...
}
and I no longer check the first byte for malformed2(),
in which I think has the smallest performance impact for 2 bytes
sequence. I ran several
rounds of benchmark testing, I did not see significant difference. I
will try more later.
I'm not sure I understand the suggested b1 < -0x3e patch, I don't see
we can simply replace
((b1 >> 5) == -2) with (b1 < -0x3e).
Anyway, I hope now you are motivated to take a deep look at the code:-)
and maybe want to
run all your tests to confirm the change is fine.
This change does expose an existing bug/issue in StreamDecoder, in which
the StreamDecoder fails
to replace a "malformed" input, in which a "leading byte" is at the end
of the stream. This is why
I commended the line in Errors. I will file a bug for this one later.
> 5. IMHO charset CESU-8 should be hosted in extended-charsets,
> otherwise it should be added to java.nio.StandardCharsets
>
We have lots of charsets provided via the "standard charset provider"
(in rt.jar) but not listed in StandardCharsets.
-Sherman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/i18n-dev/attachments/20110928/f234c99e/attachment.html
More information about the i18n-dev
mailing list