<i18n dev> Codereview request for 7096080: UTF8 update and new CESU-8 charset

Fri Oct 14 01:30:46 PDT 2011

Am 13.10.2011 21:13, schrieb Xueming Shen:
> On 10/13/2011 09:55 AM, Ulf Zibis wrote:
>> Am 11.10.2011 19:49, schrieb Xueming Shen:
>>>
>>> I don't know which one is better, I did a run on
>>>
>>>     private static boolean op1(int b) {
>>>         return (b >> 6) != -2;
>>>     }
>>>     private static boolean op2(int b) {
>>>         return (b & 0xc0) != 0x80;
>>>     }
>>>     private static boolean op3(byte b) {
>>>          return b >= (byte)0xc0;
>>>     }
>>>
>>> with 1000000 iteration on my linux machine,  and got the scores
>>>
>>> op1=1149
>>> op2=1147
>>> op3=1146
>>>
>>> I would interpret it as they are identical.
>> Me too. thanks for your effort.
>> Maybe the comparison would differ on different architectures.
>>
>> So I would prefer opt3, because the others ...
>> 1. in question need 1 more CPU register to save the original value of b for later usage
>> 2. need 1 more constant to load into CPU
>> and opt 3 ...
>> 3. is the best readable source code
>> 4. in question seems best to help Hotspot finding best optimization on arbitrary architectures.
5. is the smallest in bytecode footprint
6. so interpreter would be faster too.

> I doubt it's more "readable":-), given it's the "byte operation" means
> "<0x80 && >= 0xc0" in int.
If b would be an unsigned int in range [0..0xFF], half yes (it would be: b<0x80 || b>=0xc0).
But b is in range [-0x80..0x7F] due to it's origin from a byte array, so the operation translated to 
int would be: "b < -0x80 || b >= -0x40"

> You need "b" to be byte for b >= (byte)0xc0
No, it works as same for int, because the lower limit -0x80 will never be exceeded and (byte)0xc0 is 
-0x40.
So the notation "b >= (byte)0xc0" looks most close to its real unsigned counterpart.

> to be the equivalent of "<0x80 && >= 0xc0" and all use cases in UTF-8
> existing implementation the "b" has been stored in "int" already.  Arguably
> you can update the whole implementation to achieve this,
yes, that's exactly what I wanted to say.

> but personally
> I would like to just stick to the problem this proposal is trying to solve.
I agree, but it's not much more than declaring the bx as byte.

>
> And, no, for the same reason I don't want to replace all "(b & 0xc0) != 0x80
> by "isNotContinuation(b)", they just look fine for me, together with their
> neighbors, such as "<0x80 && >= 0xc0".
Yes, they look fine, but the reader always must put in mind, that "(b & 0xc0) != 0x80" is 
semantically same than "isNotContinuation(b)".
Why you introduce isNotContinuation(b) at all? It could always be inlined, as I don't think, the 
tiny operation has any effect on HotSpot's optimization strategy, and as a side effect, I guess C1 
code would be faster.

-Ulf


>
> -Sherman
>
>>
>> Additionally I guess using always byte variables would in question help HotSpot to optimize with 
>> 1-byte-operand CPU instructions.
>>
>> Don't you like to replace all "(bx & 0xc0) != 0x80" by "isNotContinuation(bx)" ?
>>
>> -Ulf
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/i18n-dev/attachments/20111014/886a06c5/attachment.html