String.lastIndexOf confused by unpaired trailing surrogate

Martin Buchholz martinrb at google.com
Fri Mar 26 00:06:27 UTC 2010


On Thu, Mar 25, 2010 at 10:19, Ulf Zibis <Ulf.Zibis at gmx.de> wrote:
> Am 24.03.2010 09:24, schrieb Martin Buchholz:

>> Addition of highSurrogate and lowSurrogate
>> imported patch highSurrogate
>> http://cr.openjdk.java.net/~martin/webrevs/openjdk7/highSurrogate
>>
>
> Looks good. Interesting workaround on my "Note:"
> I've reckoned with dropping my highSurrogate(char highCPWord, char
> lowCPWord).

Yeah, it's not the kind of method that tends to become a public API.

If you can demonstrate a real performance advantage for highSurrogate(char,char)
beyond just EUC_TW, esp in UTF_8, then we can put it into Surrogate.java.

Martin

> Anyway I like to note, that I use that shortcut in my EUC_TW$Decoder
> twiddling. Following code:
>
>            da[dp] = Character.highSurrogate(0x20000 + c);
> results in (19 bytes):
>  0x00b8ae27: add    $0x20000,%ecx      ;*iadd
>                                        ; -
> sun.nio.cs.ext.D_21_d_narrow::decode at 98 (line 196)
>  0x00b8ae2d: mov    %ecx,%ebp
>  0x00b8ae2f: shr    $0xa,%ebp
>  0x00b8ae32: add    $0xd7c0,%ebp       ;*isub
>                                        ; -
> java.lang.Character::highSurrogate at 9 (line 3343)
>                                        ; -
> sun.nio.cs.ext.D_21_d_narrow::decode at 99 (line 196)
>
>            da[dp] = Character.highSurrogate((char)0x2, c);
> results in (9 bytes):
>  0x00b899e7: shr    $0xa,%ebp
>  0x00b899ea: add    $0xd840,%ebp       ;*isub
>                                        ; -
> java.lang.Character::highSurrogate at 14 (line 3365)
>                                        ; -
> sun.nio.cs.ext.D_22_d_n_fastSurrogate::decode at 97 (line 196)
>
>
>            dst.putInt(Character.highSurrogate((char)0x2, c)) << 16 |
> Character.lowSurrogate(c));
> would additionally increase performance. I'm still preparing the benchmark +
> disassembly.
>
> Those twiddling could be used in all surrogate processing charset coders,
> e.g. maybe true for UTF_x.
> If public, would be too useful for developers coding charset coders for
> exotic charsets via java.nio.charset.spi.CharsetProvider
>
> -Ulf
>
>
>



More information about the core-libs-dev mailing list