String.lastIndexOf confused by unpaired trailing surrogate
Martin Buchholz
martinrb at google.com
Fri Mar 26 00:06:27 UTC 2010
On Thu, Mar 25, 2010 at 10:19, Ulf Zibis <Ulf.Zibis at gmx.de> wrote:
> Am 24.03.2010 09:24, schrieb Martin Buchholz:
>> Addition of highSurrogate and lowSurrogate
>> imported patch highSurrogate
>> http://cr.openjdk.java.net/~martin/webrevs/openjdk7/highSurrogate
>>
>
> Looks good. Interesting workaround on my "Note:"
> I've reckoned with dropping my highSurrogate(char highCPWord, char
> lowCPWord).
Yeah, it's not the kind of method that tends to become a public API.
If you can demonstrate a real performance advantage for highSurrogate(char,char)
beyond just EUC_TW, esp in UTF_8, then we can put it into Surrogate.java.
Martin
> Anyway I like to note, that I use that shortcut in my EUC_TW$Decoder
> twiddling. Following code:
>
> da[dp] = Character.highSurrogate(0x20000 + c);
> results in (19 bytes):
> 0x00b8ae27: add $0x20000,%ecx ;*iadd
> ; -
> sun.nio.cs.ext.D_21_d_narrow::decode at 98 (line 196)
> 0x00b8ae2d: mov %ecx,%ebp
> 0x00b8ae2f: shr $0xa,%ebp
> 0x00b8ae32: add $0xd7c0,%ebp ;*isub
> ; -
> java.lang.Character::highSurrogate at 9 (line 3343)
> ; -
> sun.nio.cs.ext.D_21_d_narrow::decode at 99 (line 196)
>
> da[dp] = Character.highSurrogate((char)0x2, c);
> results in (9 bytes):
> 0x00b899e7: shr $0xa,%ebp
> 0x00b899ea: add $0xd840,%ebp ;*isub
> ; -
> java.lang.Character::highSurrogate at 14 (line 3365)
> ; -
> sun.nio.cs.ext.D_22_d_n_fastSurrogate::decode at 97 (line 196)
>
>
> dst.putInt(Character.highSurrogate((char)0x2, c)) << 16 |
> Character.lowSurrogate(c));
> would additionally increase performance. I'm still preparing the benchmark +
> disassembly.
>
> Those twiddling could be used in all surrogate processing charset coders,
> e.g. maybe true for UTF_x.
> If public, would be too useful for developers coding charset coders for
> exotic charsets via java.nio.charset.spi.CharsetProvider
>
> -Ulf
>
>
>
More information about the core-libs-dev
mailing list