review request for 6798511/6860431: Include functionality of Surrogate in Character

Ulf Zibis Ulf.Zibis at gmx.de
Wed Mar 3 10:44:51 UTC 2010


Am 03.03.2010 09:00, schrieb Martin Buchholz:
> On Tue, Mar 2, 2010 at 15:34, Ulf Zibis<Ulf.Zibis at gmx.de>  wrote:
>    
>> Am 26.08.2009 20:02, schrieb Xueming Shen:
>>      
>>> For example, the isBMP(int), it might be convenient, but it can be easily
>>> archived by the one line code
>>>
>>> (int)(char)codePoint == codePoint;
>>>
>>> or more readable form
>>>
>>>    codePoint<  Character.MIN_SUPPLEMENTARY_COE_POINT;
>>>
>>>        
>> In class sun.nio.cs.Surrogate we have:
>>     public static boolean isBMP(int uc) {
>>         return (int) (char) uc == uc;
>>     }
>>
>> 1.) It's enough to have:
>>         return (char)uc == uc;
>>     better:
>>         assert MIN_VALUE == 0&&  MAX_VALUE == 0xFFFF;
>>         return (char)uc == uc;
>>         // Optimized form of: uc>= MIN_VALUE&&  uc<= MAX_VALUE
>>
>> 2.) Above code is compiled to (needs 16 bytes of machine code):
>>   0x00b87ad8: mov    %ebx,%ebp
>>   0x00b87ada: and    $0xffff,%ebp
>>   0x00b87ae0: cmp    %ebx,%ebp
>>   0x00b87ae2: jne    0x00b87c52
>>   0x00b87ae8:
>>
>>     We could code:
>>         assert MIN_VALUE == 0&&  (MAX_VALUE + 1) == (1<<  16);
>>         return (uc>>  16) == 0;
>>         // Optimized form of: uc>= MIN_VALUE&&  uc<= MAX_VALUE
>>      
> I agree that
> return (uc>>  16) == 0;
> is marginally better than my
> return (int) (char) uc == uc;
> (although I think the redundant cast to int
> makes the code more readable).
>    

Seems to be individual. I always stumble over superfluous casts by 
thinking about, what they have to do.

> I approve such a change to isBMPCodePoint()
> and inclusion of such a method in Character.
>    

Pleased! Who could file the bug? I would provide the patch.

>
>    
>>     is compiled to (needs only 9 bytes of machine code):
>>   0x00b87aac: mov    %ebx,%ecx
>>   0x00b87aae: sar    $0x10,%ecx
>>   0x00b87ab1: test   %ecx,%ecx
>>   0x00b87ab3: je     0x00b87acb
>>   0x00b87ab5:
>>
>> 1.) If we have:
>>     public static boolean isSupplementaryCodePoint(int codePoint) {
>>         assert MIN_SUPPLEMENTARY_CODE_POINT == (1<<  16)&&
>>                 (MAX_SUPPLEMENTARY_CODE_POINT + 1) % (1<<  16) == 0;
>>         return (codePoint>>  16) != 0
>> &&  (codePoint>>  16)<  (MAX_SUPPLEMENTARY_CODE_POINT + 1>>  16);
>>         // Optimized form of: codePoint>= MIN_SUPPLEMENTARY_CODE_POINT
>>         //&&  codePoint<= MAX_SUPPLEMENTARY_CODE_POINT;
>>     }
>>      
> Keep in mind that supplementary characters are extremely rare.
>    

Yes, but many API's in the JDK are used rarely.
Why should they waste memory footprint / perform bad, particularly if it 
doesn't cost anything.

> Therefore the existing implementation
>
>   return codePoint>= MIN_SUPPLEMENTARY_CODE_POINT
>              &&  codePoint<= MAX_CODE_POINT;
>
> will almost always perform just one comparison against a constant,
> which is hard to beat.
>    

1. Wondering: I think there are TWO comparisons.
2. Those comparisons need to load 32 bit values from machine code, 
against only 8 bit values in my case.
3. The first of the 2 comparisons becomes outlined if compiled in 
combination with isBMPCodePoint(). (see below)


> I'm not sure whether your code above gets the right answer for negative input.
> Perhaps you need to do
> (codePoint>>>  16)<  ...
>    

Oops, I'm afraid you are right, but fortunately it doesn't cost 
anything: sar becomes replaced by shr.

> Martin
>
>    
>> and:
>>         if (Surrogate.isBMP(uc))
>>             ...;
>>         else if (Character.isSupplementaryCodePoint(uc))
>>             ...;
>>         else
>>             ...;
>>
>>     we get (needs only 18 bytes of machine code):
>>   0x00b87aac: mov    %ebx,%ecx
>>   0x00b87aae: sar    $0x10,%ecx
>>   0x00b87ab1: test   %ecx,%ecx
>>   0x00b87ab3: je     0x00b87acb
>>   0x00b87ab5: cmp    $0x11,%ecx
>>   0x00b87ab8: jge    0x00b87ce6
>>   0x00b87abe:
>>
>>      

BTW the compiled code of the existing code (needs *36* bytes of machine 
code):
   0x00b87a5c: test   %ebp,%ebp
   0x00b87a5e: jl     0x00b87a68
   0x00b87a60: cmp    $0x10000,%ebp
   0x00b87a66: jl     0x00b87a8d
   0x00b87a68: cmp    $0x10000,%ebp
   0x00b87a6e: jl     0x00b87c63
   0x00b87a74: cmp    $0x10ffff,%ebp
   0x00b87a7a: jg     0x00b87c63
   0x00b87a80:

BTW 2: The code example is seen in one of the String constructors, where 
there the 2-comparison code is manually inlined instead of using 
Surrogate.isBMP().

-Ulf






More information about the core-libs-dev mailing list