Codereview request for 7096080: UTF8 update and new CESU-8 charset

Sat Oct 1 21:21:33 UTC 2011

Am 30.09.2011 22:46, schrieb Xueming Shen:
> On 09/30/2011 07:09 AM, Ulf Zibis wrote:
>>>
>>> (1) new byte[]{(byte)0xE1, (byte)0x80, (byte)0x42} ---> CoderResult.malformedForLength(1)
>>> It appears the Unicode Standard now explicitly recommends to return the malformed length 2,
>>> what UTF-8 is doing now, for this scenario
>> My idea behind was, that in case of malformed length 1 a consecutive call to the decode loop 
>> would again return another malformed length 1, to ensure 2 replacement chars in the output 
>> string. (Not sure, if that is expected in this corner case.)
>
> Unicode Standard's "best practices" D93a/b recommends to return 2 in this case.
Can you please give me a link for D93a/a. I don't know, where to find it.

>
>
>> 3. Consider additionally 6795537 - UTF_8$Decoder returns wrong results 
>> <http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6795537>
>>
>>
>>> I'm not sure I  understand the suggested  b1 < -0x3e patch, I don't see we can simply replace
>>> ((b1 >> 5) == -2) with (b1 < -0x3e).
>> You must see the b1 < -0x3e in combination with the following b1 < -0x20. ;-)
>>
>> But now I have a better "if...else if" switch. :-)
>> - saves the shift operations
>> - only 1 comparison per case
>> - only 1 constant to load per case
>> - helps compiler to benefit from 1 byte constants and op-codes
>> - much better readable
>
> I believe we changed from (b1 < xyz) to (b1 >> x) == -2 back to 2009(?) because
> the benchmark shows the "shift" version is slightly faster.
IIRC this was only about a shift by multiples of 8 to ensure an 1-byte comparison of 16/32-byte 
values in the double/quad-byte charsets.

> Do you have any number
> shows any difference now. My non-scientific benchmark still suggests the "shift"
> type is faster on -server vm, no significant difference on -client vm.
>
>   ------------------  your new switch---------------
> (1) -server
> Method                      Millis  Ratio
> Decoding 1b UTF-8 :            125  1.000
> Decoding 2b UTF-8 :           2558 20.443
> Decoding 3b UTF-8 :           3439 27.481
> Decoding 4b UTF-8 :           2030 16.221
> (2) -client
> Decoding 1b UTF-8 :            335  1.000
> Decoding 2b UTF-8 :           1041  3.105
> Decoding 3b UTF-8 :           2245  6.694
> Decoding 4b UTF-8 :           1254  3.741
>
>   ------------------ existing "shift"---------------
> (1) -server
> Decoding 1b UTF-8 :            134  1.000
> Decoding 2b UTF-8 :           1891 14.106
> Decoding 3b UTF-8 :           2934 21.886
> Decoding 4b UTF-8 :           2133 15.913
> (2) -client
> Decoding 1b UTF-8 :            341  1.000
> Decoding 2b UTF-8 :            949  2.560
> Decoding 3b UTF-8 :           2321  6.255
> Decoding 4b UTF-8 :           1278  3.446
>
Very interesting and surprising numbers!
The most surprising is, that the client compiler generates faster code for 2..4-byte codes. I think, 
we should ask the HotSpot team for help. As the UTF-8 de/encoding is a very frequent task, HotSpot 
should provide compiled code as optimized best as possible for UTF-8 de/encoding.
Another surprise is, that benchmark for 1b UTF-8 is not same for "new switch" and "shift" version, 
as the ASCII only loop is the same in both versions.
To discover the miracle, why the"shift" version is little faster than the "new switch" version, it 
should be helpful, to see the disassembling of the HotSpot compiled code.
A third version, using the "(b1 & 0xe0) == 0xc0"/"(b1 & 0xf0) == 0xe0"/"(b1 & 0xf8) == 0xf0" 
pattern, should be interesting toofor the benchmark comparison.

In my opinion it would be more significant to compare x 1..4-byte codes than y bytes of 1..4-byte 
codes. I.e. 1000 bytes of 1-byte codes against 2000 bytes of 2-byte codes against 3000 bytes of 
3-byte codes against 4000 bytes of 4-byte codes

We should document somewhere, that the ESU-8 decoder is faster than the strong UTF-8 decoder for 
developers, who can ensure, that there are no invalid surrogates in their source bytes.

-Ulf