review request for 6798511/6860431: Include functionality of Surrogate in Character
Ulf Zibis
Ulf.Zibis at gmx.de
Wed Mar 3 10:44:51 UTC 2010
Am 03.03.2010 09:00, schrieb Martin Buchholz:
> On Tue, Mar 2, 2010 at 15:34, Ulf Zibis<Ulf.Zibis at gmx.de> wrote:
>
>> Am 26.08.2009 20:02, schrieb Xueming Shen:
>>
>>> For example, the isBMP(int), it might be convenient, but it can be easily
>>> archived by the one line code
>>>
>>> (int)(char)codePoint == codePoint;
>>>
>>> or more readable form
>>>
>>> codePoint< Character.MIN_SUPPLEMENTARY_COE_POINT;
>>>
>>>
>> In class sun.nio.cs.Surrogate we have:
>> public static boolean isBMP(int uc) {
>> return (int) (char) uc == uc;
>> }
>>
>> 1.) It's enough to have:
>> return (char)uc == uc;
>> better:
>> assert MIN_VALUE == 0&& MAX_VALUE == 0xFFFF;
>> return (char)uc == uc;
>> // Optimized form of: uc>= MIN_VALUE&& uc<= MAX_VALUE
>>
>> 2.) Above code is compiled to (needs 16 bytes of machine code):
>> 0x00b87ad8: mov %ebx,%ebp
>> 0x00b87ada: and $0xffff,%ebp
>> 0x00b87ae0: cmp %ebx,%ebp
>> 0x00b87ae2: jne 0x00b87c52
>> 0x00b87ae8:
>>
>> We could code:
>> assert MIN_VALUE == 0&& (MAX_VALUE + 1) == (1<< 16);
>> return (uc>> 16) == 0;
>> // Optimized form of: uc>= MIN_VALUE&& uc<= MAX_VALUE
>>
> I agree that
> return (uc>> 16) == 0;
> is marginally better than my
> return (int) (char) uc == uc;
> (although I think the redundant cast to int
> makes the code more readable).
>
Seems to be individual. I always stumble over superfluous casts by
thinking about, what they have to do.
> I approve such a change to isBMPCodePoint()
> and inclusion of such a method in Character.
>
Pleased! Who could file the bug? I would provide the patch.
>
>
>> is compiled to (needs only 9 bytes of machine code):
>> 0x00b87aac: mov %ebx,%ecx
>> 0x00b87aae: sar $0x10,%ecx
>> 0x00b87ab1: test %ecx,%ecx
>> 0x00b87ab3: je 0x00b87acb
>> 0x00b87ab5:
>>
>> 1.) If we have:
>> public static boolean isSupplementaryCodePoint(int codePoint) {
>> assert MIN_SUPPLEMENTARY_CODE_POINT == (1<< 16)&&
>> (MAX_SUPPLEMENTARY_CODE_POINT + 1) % (1<< 16) == 0;
>> return (codePoint>> 16) != 0
>> && (codePoint>> 16)< (MAX_SUPPLEMENTARY_CODE_POINT + 1>> 16);
>> // Optimized form of: codePoint>= MIN_SUPPLEMENTARY_CODE_POINT
>> //&& codePoint<= MAX_SUPPLEMENTARY_CODE_POINT;
>> }
>>
> Keep in mind that supplementary characters are extremely rare.
>
Yes, but many API's in the JDK are used rarely.
Why should they waste memory footprint / perform bad, particularly if it
doesn't cost anything.
> Therefore the existing implementation
>
> return codePoint>= MIN_SUPPLEMENTARY_CODE_POINT
> && codePoint<= MAX_CODE_POINT;
>
> will almost always perform just one comparison against a constant,
> which is hard to beat.
>
1. Wondering: I think there are TWO comparisons.
2. Those comparisons need to load 32 bit values from machine code,
against only 8 bit values in my case.
3. The first of the 2 comparisons becomes outlined if compiled in
combination with isBMPCodePoint(). (see below)
> I'm not sure whether your code above gets the right answer for negative input.
> Perhaps you need to do
> (codePoint>>> 16)< ...
>
Oops, I'm afraid you are right, but fortunately it doesn't cost
anything: sar becomes replaced by shr.
> Martin
>
>
>> and:
>> if (Surrogate.isBMP(uc))
>> ...;
>> else if (Character.isSupplementaryCodePoint(uc))
>> ...;
>> else
>> ...;
>>
>> we get (needs only 18 bytes of machine code):
>> 0x00b87aac: mov %ebx,%ecx
>> 0x00b87aae: sar $0x10,%ecx
>> 0x00b87ab1: test %ecx,%ecx
>> 0x00b87ab3: je 0x00b87acb
>> 0x00b87ab5: cmp $0x11,%ecx
>> 0x00b87ab8: jge 0x00b87ce6
>> 0x00b87abe:
>>
>>
BTW the compiled code of the existing code (needs *36* bytes of machine
code):
0x00b87a5c: test %ebp,%ebp
0x00b87a5e: jl 0x00b87a68
0x00b87a60: cmp $0x10000,%ebp
0x00b87a66: jl 0x00b87a8d
0x00b87a68: cmp $0x10000,%ebp
0x00b87a6e: jl 0x00b87c63
0x00b87a74: cmp $0x10ffff,%ebp
0x00b87a7a: jg 0x00b87c63
0x00b87a80:
BTW 2: The code example is seen in one of the String constructors, where
there the 2-comparison code is manually inlined instead of using
Surrogate.isBMP().
-Ulf
More information about the core-libs-dev
mailing list