<i18n dev> Codereview request for 7096080: UTF8 update and new CESU-8 charset

Fri Sep 30 07:09:33 PDT 2011

Hi,

Am 29.09.2011 05:27, schrieb Xueming Shen:
> Hi,
>
> On 9/28/2011 3:44 PM, Ulf Zibis wrote
>> 3. Consider additionally 6795537 - UTF_8$Decoder returns wrong results 
>> <http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6795537>
>>
>
> (1) new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42} ---> CoderResult.malformedForLength(1)
> It appears the Unicode Standard now explicitly recommends to return the malformed length 2,
> what UTF-8 is doing now, for this scenario
My idea behind is, that in case of malformed length 1 a consecutive call to the decode loop would 
again return another malformed length 1, to ensure 2 replacement chars in the output string. (Not 
sure, if that is expected in this corner case.)

> I'm not sure I  understand the suggested  b1 < -0x3e patch, I don't see we can simply replace
> ((b1 >> 5) == -2) with (b1 < -0x3e).
You must see the b1 < -0x3e in combination with the following b1 < -0x20. ;-)

But now I have a better "if...else if" switch. :-)
- saves the shift operations
- only 1 comparison per case
- only 1 constant to load per case
- helps compiler to benefit from 1 byte constants and op-codes
- much better readable

                 byte b1 = sa[sp]; // help compiler to benefit from 1 byte op-codes and constants
//                byte b1 = src.get();// help compiler to benefit from 1 byte op-codes and constants
Byte1Switch:    if (b1 >= 0) {
                     // 1 byte, 7 bits: 0xxxxxxx
                     ...return x;
                 } else if (b1 < (byte)0xe0) {
                     // 2 bytes, 11 bits: 110xxxxx 10xxxxxx
                     if (b1 < (byte)0xc2) // b1 < C2 not legal
                         break Byte1Switch;
                     ...return x;
                 } else if (b1 < (byte)0xf0) {
                     // 3 bytes, 16 bits: 1110xxxx 10xxxxxx 10xxxxxx
                     ...return x;
                 } else if (b1 < (byte)0xf8) {
                     // 4 bytes, 21 bits: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
                     ...return x;
                 }
                 return malformed(src, sp, dst, dp, 1);
//                return malformed(src, mark, 1);

>
> Anyway, I hope now you are motivated to take a deep look at the code:-) and maybe want to
> run all your tests to confirm the change is fine.

At the moment I don't have a well running system, so my contribution must remain limited. :-(

About motivation:
For me it's kinda frustrating, seeing a bug from external voluntary contributor as "Will Not Fix", 
but some time later an Oracle employée creates an "equivalent" new bug from scratch without at least 
referring to the existing ones, e.g.: 6798514, 6795537.

Additionally I think, now it's the right time to re-evaluate bug 4508058 - UTF-8 encoding does not 
recognize initial BOM.

Cheers,

-Ulf

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/i18n-dev/attachments/20110930/e693329e/attachment.html