<i18n dev> Codereview request for 7096080: UTF8 update and new CESU-8 charset

Xueming Shen xueming.shen at oracle.com
Sat Oct 1 23:29:11 PDT 2011


http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf

Go to 3.9 Unicode Encoding Forms. Or simply search D93

On 10/1/2011 2:21 PM, Ulf Zibis wrote:
> Am 30.09.2011 22:46, schrieb Xueming Shen:
>> On 09/30/2011 07:09 AM, Ulf Zibis wrote:
>>>>
>>>> (1) new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42} ---> 
>>>> CoderResult.malformedForLength(1)
>>>> It appears the Unicode Standard now explicitly recommends to return 
>>>> the malformed length 2,
>>>> what UTF-8 is doing now, for this scenario
>>> My idea behind was, that in case of malformed length 1 a consecutive 
>>> call to the decode loop would again return another malformed length 
>>> 1, to ensure 2 replacement chars in the output string. (Not sure, if 
>>> that is expected in this corner case.)
>>
>> Unicode Standard's "best practices" D93a/b recommends to return 2 in 
>> this case.
> Can you please give me a link for D93a/a. I don't know, where to find it.
>
>
>>
>>
>>> 3. Consider additionally 6795537 - UTF_8$Decoder returns wrong 
>>> results <http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6795537>
>>>
>>>
>>>> I'm not sure I  understand the suggested  b1 < -0x3e patch, I don't 
>>>> see we can simply replace
>>>> ((b1 >> 5) == -2) with (b1 < -0x3e).
>>> You must see the b1 < -0x3e in combination with the following b1 < 
>>> -0x20. ;-)
>>>
>>> But now I have a better "if...else if" switch. :-)
>>> - saves the shift operations
>>> - only 1 comparison per case
>>> - only 1 constant to load per case
>>> - helps compiler to benefit from 1 byte constants and op-codes
>>> - much better readable
>>
>> I believe we changed from (b1 < xyz) to (b1 >> x) == -2 back to 
>> 2009(?) because
>> the benchmark shows the "shift" version is slightly faster.
> IIRC this was only about a shift by multiples of 8 to ensure an 1-byte 
> comparison of 16/32-byte values in the double/quad-byte charsets.
>
>
>> Do you have any number
>> shows any difference now. My non-scientific benchmark still suggests 
>> the "shift"
>> type is faster on -server vm, no significant difference on -client vm.
>>
>>   ------------------  your new switch---------------
>> (1) -server
>> Method                      Millis  Ratio
>> Decoding 1b UTF-8 :            125  1.000
>> Decoding 2b UTF-8 :           2558 20.443
>> Decoding 3b UTF-8 :           3439 27.481
>> Decoding 4b UTF-8 :           2030 16.221
>> (2) -client
>> Decoding 1b UTF-8 :            335  1.000
>> Decoding 2b UTF-8 :           1041  3.105
>> Decoding 3b UTF-8 :           2245  6.694
>> Decoding 4b UTF-8 :           1254  3.741
>>
>>   ------------------ existing "shift"---------------
>> (1) -server
>> Decoding 1b UTF-8 :            134  1.000
>> Decoding 2b UTF-8 :           1891 14.106
>> Decoding 3b UTF-8 :           2934 21.886
>> Decoding 4b UTF-8 :           2133 15.913
>> (2) -client
>> Decoding 1b UTF-8 :            341  1.000
>> Decoding 2b UTF-8 :            949  2.560
>> Decoding 3b UTF-8 :           2321  6.255
>> Decoding 4b UTF-8 :           1278  3.446
>>
> Very interesting and surprising numbers!
> The most surprising is, that the client compiler generates faster code 
> for 2..4-byte codes. I think, we should ask the HotSpot team for help. 
> As the UTF-8 de/encoding is a very frequent task, HotSpot should 
> provide compiled code as optimized best as possible for UTF-8 de/encoding.
> Another surprise is, that benchmark for 1b UTF-8 is not same for "new 
> switch" and "shift" version, as the ASCII only loop is the same in 
> both versions.
> To discover the miracle, why the"shift" version is little faster than 
> the "new switch" version, it should be helpful, to see the 
> disassembling of the HotSpot compiled code.
> A third version, using the "(b1 & 0xe0) == 0xc0"/"(b1 & 0xf0) == 
> 0xe0"/"(b1 & 0xf8) == 0xf0" pattern, should be interesting toofor the 
> benchmark comparison.
>
> In my opinion it would be more significant to compare x 1..4-byte 
> codes than y bytes of 1..4-byte codes. I.e. 1000 bytes of 1-byte codes 
> against 2000 bytes of 2-byte codes against 3000 bytes of 3-byte codes 
> against 4000 bytes of 4-byte codes
>
> We should document somewhere, that the ESU-8 decoder is faster than 
> the strong UTF-8 decoder for developers, who can ensure, that there 
> are no invalid surrogates in their source bytes.
>
> -Ulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/i18n-dev/attachments/20111001/e56b66bb/attachment.html 


More information about the i18n-dev mailing list