<i18n dev> Codereview request for 7096080: UTF8 update and new CESU-8 charset

Xueming Shen xueming.shen at oracle.com
Fri Sep 30 13:46:42 PDT 2011


On 09/30/2011 07:09 AM, Ulf Zibis wrote:
>>>
>>
>> (1) new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42} ---> 
>> CoderResult.malformedForLength(1)
>> It appears the Unicode Standard now explicitly recommends to return 
>> the malformed length 2,
>> what UTF-8 is doing now, for this scenario
> My idea behind is, that in case of malformed length 1 a consecutive 
> call to the decode loop would again return another malformed length 1, 
> to ensure 2 replacement chars in the output string. (Not sure, if that 
> is expected in this corner case.)

Unicode Standard's "best practices" D93a/b recommends to return 2 in 
this case.


> 3. Consider additionally 6795537 - UTF_8$Decoder returns wrong results 
> <http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6795537>
>
>
>> I'm not sure I  understand the suggested  b1 < -0x3e patch, I don't 
>> see we can simply replace
>> ((b1 >> 5) == -2) with (b1 < -0x3e).
> You must see the b1 < -0x3e in combination with the following b1 < 
> -0x20. ;-)
>
> But now I have a better "if...else if" switch. :-)
> - saves the shift operations
> - only 1 comparison per case
> - only 1 constant to load per case
> - helps compiler to benefit from 1 byte constants and op-codes
> - much better readable

I believe we changed from (b1 < xyz) to (b1 >> x) == -2 back to 2009(?) 
because
the benchmark shows the "shift" version is slightly faster. Do you have 
any number
shows any difference now. My non-scientific benchmark still suggests the 
"shift"
type is faster on -server vm, no significant difference on -client vm.

   ------------------  your new switch---------------
(1) -server
Method                      Millis  Ratio
Decoding 1b UTF-8 :            125  1.000
Decoding 2b UTF-8 :           2558 20.443
Decoding 3b UTF-8 :           3439 27.481
Decoding 4b UTF-8 :           2030 16.221
(2) -client
Decoding 1b UTF-8 :            335  1.000
Decoding 2b UTF-8 :           1041  3.105
Decoding 3b UTF-8 :           2245  6.694
Decoding 4b UTF-8 :           1254  3.741

   ------------------ existing "shift"---------------
(1) -server
Decoding 1b UTF-8 :            134  1.000
Decoding 2b UTF-8 :           1891 14.106
Decoding 3b UTF-8 :           2934 21.886
Decoding 4b UTF-8 :           2133 15.913
(2) -client
Decoding 1b UTF-8 :            341  1.000
Decoding 2b UTF-8 :            949  2.560
Decoding 3b UTF-8 :           2321  6.255
Decoding 4b UTF-8 :           1278  3.446



-sherman

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/i18n-dev/attachments/20110930/13d31517/attachment.html 


More information about the i18n-dev mailing list