Sponsor for 6666666: A better implementation of Character.isSupplementaryCodePoint

Thu Mar 11 21:14:10 UTC 2010

Am 11.03.2010 20:38, schrieb Martin Buchholz:
> Ulf, your changes would be easier to get in
> if they were organized as mq patch files that
> could be qimported into an existing mq repo.
>

To be honest, I never heard about mq. Can you point me to some docs please?

> I've done that below, which includes a subset of
> your own proposed changes:
>
> http://cr.openjdk.java.net/~martin/webrevs/openjdk7/isSupplementaryCodePoint/
> http://cr.openjdk.java.net/~martin/webrevs/openjdk7/public-isBMPCodePoint/
>

- Maybe better:  "... using a single {@code char}".
- Why don't you like using the new isBMPCodePoint() for isSupplementaryCodePoint() and 
toUpperCaseCharArray() ?
- Same shift magic would enhance isISOControl(), isHighSurrogate(), isLowSurrogate(), in particular 
if latter occur consecutive.
   8-bit shift + compare would allow HotSpot to compile to smart 1-byte immediate op-codes.
- Don't you think my notes on validity are worth to add. (or separate bug ?)
- Changing ch <= MAX_SURROGATE to ch < MAX_SURROGATE + 1 would allow HotSpot compiler to optimize 1 
branch if those methods are used consecutive.
- And at last, I would like to make the constants complete (= adding MAX_SUPPLEMENTARY_CODE_POINT).

> http://cr.openjdk.java.net/~martin/webrevs/openjdk7/Character-warnings/
>

Remembers me that some months ago I prepared a beautified version of Character's source (things like 
above, replacing <code> against {@code}, indentation inconsistencies etc.) Would there be interest 
to provide such a patch ?

> http://cr.openjdk.java.net/~martin/webrevs/openjdk7/malformed-utf8/
>

In encodeBufferLoop() you could use putChar(), putInt() instead put(). Should perform better.

> Sherman (or Alan),
>
> please review and/or file bugs for the above changes.
>
> isBMPCodePoint is a spec addition, requiring additional paperwork.
>
> Sherman, you owe me a response to my now-moldy proposed changes to
> the UTF-8 charset.
>
> The only controversial change would be the change in behavior in
> malformed-utf8, which I can take out.
>

This remembers me at some thoughts. To be *exact* I think malformed should be returned for all 
codes, which are invalid in the regarding character set. So first validate for unmappable and second 
for invalid (=malformed). Doesn't cost any performance in looping mappable and valid characters, but 
little more effort after the loop is interrupted to form the right CoderResult.

-Ulf