review request for 6798511/6860431: Include functionality of Surrogate in Character

Martin Buchholz martinrb at google.com
Wed Mar 3 08:00:05 UTC 2010


On Tue, Mar 2, 2010 at 15:34, Ulf Zibis <Ulf.Zibis at gmx.de> wrote:
> Am 26.08.2009 20:02, schrieb Xueming Shen:
>>
>> For example, the isBMP(int), it might be convenient, but it can be easily
>> archived by the one line code
>>
>> (int)(char)codePoint == codePoint;
>>
>> or more readable form
>>
>>   codePoint < Character.MIN_SUPPLEMENTARY_COE_POINT;
>>
>
> In class sun.nio.cs.Surrogate we have:
>    public static boolean isBMP(int uc) {
>        return (int) (char) uc == uc;
>    }
>
> 1.) It's enough to have:
>        return (char)uc == uc;
>    better:
>        assert MIN_VALUE == 0 && MAX_VALUE == 0xFFFF;
>        return (char)uc == uc;
>        // Optimized form of: uc >= MIN_VALUE && uc <= MAX_VALUE
>
> 2.) Above code is compiled to (needs 16 bytes of machine code):
>  0x00b87ad8: mov    %ebx,%ebp
>  0x00b87ada: and    $0xffff,%ebp
>  0x00b87ae0: cmp    %ebx,%ebp
>  0x00b87ae2: jne    0x00b87c52
>  0x00b87ae8:
>
>    We could code:
>        assert MIN_VALUE == 0 && (MAX_VALUE + 1) == (1 << 16);
>        return (uc >> 16) == 0;
>        // Optimized form of: uc >= MIN_VALUE && uc <= MAX_VALUE

I agree that
return (uc >> 16) == 0;
is marginally better than my
return (int) (char) uc == uc;
(although I think the redundant cast to int
makes the code more readable).
I approve such a change to isBMPCodePoint()
and inclusion of such a method in Character.


>    is compiled to (needs only 9 bytes of machine code):
>  0x00b87aac: mov    %ebx,%ecx
>  0x00b87aae: sar    $0x10,%ecx
>  0x00b87ab1: test   %ecx,%ecx
>  0x00b87ab3: je     0x00b87acb
>  0x00b87ab5:
>
> 1.) If we have:
>    public static boolean isSupplementaryCodePoint(int codePoint) {
>        assert MIN_SUPPLEMENTARY_CODE_POINT == (1 << 16) &&
>                (MAX_SUPPLEMENTARY_CODE_POINT + 1) % (1 << 16) == 0;
>        return (codePoint >> 16) != 0
> && (codePoint >> 16) < (MAX_SUPPLEMENTARY_CODE_POINT + 1 >> 16);
>        // Optimized form of: codePoint >= MIN_SUPPLEMENTARY_CODE_POINT
>        // && codePoint <= MAX_SUPPLEMENTARY_CODE_POINT;
>    }

Keep in mind that supplementary characters are extremely rare.
Therefore the existing implementation

 return codePoint >= MIN_SUPPLEMENTARY_CODE_POINT
            && codePoint <= MAX_CODE_POINT;

will almost always perform just one comparison against a constant,
which is hard to beat.

I'm not sure whether your code above gets the right answer for negative input.
Perhaps you need to do
(codePoint >>> 16) < ...

Martin

> and:
>        if (Surrogate.isBMP(uc))
>            ...;
>        else if (Character.isSupplementaryCodePoint(uc))
>            ...;
>        else
>            ...;
>
>    we get (needs only 18 bytes of machine code):
>  0x00b87aac: mov    %ebx,%ecx
>  0x00b87aae: sar    $0x10,%ecx
>  0x00b87ab1: test   %ecx,%ecx
>  0x00b87ab3: je     0x00b87acb
>  0x00b87ab5: cmp    $0x11,%ecx
>  0x00b87ab8: jge    0x00b87ce6
>  0x00b87abe:
>
>
> -Ulf
>
>
>
>
>



More information about the core-libs-dev mailing list