String.lastIndexOf confused by unpaired trailing surrogate

Thu Mar 25 17:19:06 UTC 2010

Am 24.03.2010 09:24, schrieb Martin Buchholz:
> Ulf, Sherman, Masayoshi,
> here are changes for you to review.
> Only the patch highSurrogate needs a separate bug filed
> (and CCC, please)
>
> Ulf, I've made some progress on integrating your changes,
> although almost all of them have been somewhat martinized:
>
> Ulf-style tidying, mostly whitespace.
> [mq]: Character-warnings2
> http://cr.openjdk.java.net/~martin/webrevs/openjdk7/Character-warnings2
>    

I would prefer (better visibility of continued line):

public final  class Character
         implements java.io.Serializable, Comparable<Character>  {

I would prefer (indicates, that we are in current class):

     #isDigit(char)
instead
     Character#isDigit(char)
but indeed better than
     java.lang.Character#isDigit(char)

> Very minor optimizations.  Barely worth doing.
> Note my removal of the need to have n++ inside the loop.
>    

Overseen. Shame on me, as that's true Ulf-style. Yes, reduces 
in/decrements on rare supplementary cases.

> imported patch ulf-opto
> http://cr.openjdk.java.net/~martin/webrevs/openjdk7/ulf-opto
>
> Addition of highSurrogate and lowSurrogate
> imported patch highSurrogate
> http://cr.openjdk.java.net/~martin/webrevs/openjdk7/highSurrogate
>    

Looks good. Interesting workaround on my "Note:"
I've reckoned with dropping my highSurrogate(char highCPWord, char 
lowCPWord).
Anyway I like to note, that I use that shortcut in my EUC_TW$Decoder 
twiddling. Following code:

             da[dp] = Character.highSurrogate(0x20000 + c);
results in (19 bytes):
   0x00b8ae27: add    $0x20000,%ecx      ;*iadd
                                         ; - 
sun.nio.cs.ext.D_21_d_narrow::decode at 98 (line 196)
   0x00b8ae2d: mov    %ecx,%ebp
   0x00b8ae2f: shr    $0xa,%ebp
   0x00b8ae32: add    $0xd7c0,%ebp       ;*isub
                                         ; - 
java.lang.Character::highSurrogate at 9 (line 3343)
                                         ; - 
sun.nio.cs.ext.D_21_d_narrow::decode at 99 (line 196)

             da[dp] = Character.highSurrogate((char)0x2, c);
results in (9 bytes):
   0x00b899e7: shr    $0xa,%ebp
   0x00b899ea: add    $0xd840,%ebp       ;*isub
                                         ; - 
java.lang.Character::highSurrogate at 14 (line 3365)
                                         ; - 
sun.nio.cs.ext.D_22_d_n_fastSurrogate::decode at 97 (line 196)

             dst.putInt(Character.highSurrogate((char)0x2, c)) << 16 | 
Character.lowSurrogate(c));
would additionally increase performance. I'm still preparing the 
benchmark + disassembly.

Those twiddling could be used in all surrogate processing charset 
coders, e.g. maybe true for UTF_x.
If public, would be too useful for developers coding charset coders for 
exotic charsets via java.nio.charset.spi.CharsetProvider

-Ulf