Sponsor for 6666666: A better implementation of Character.isSupplementaryCodePoint
Ulf Zibis
Ulf.Zibis at gmx.de
Thu Mar 11 21:14:10 UTC 2010
Am 11.03.2010 20:38, schrieb Martin Buchholz:
> Ulf, your changes would be easier to get in
> if they were organized as mq patch files that
> could be qimported into an existing mq repo.
>
To be honest, I never heard about mq. Can you point me to some docs please?
> I've done that below, which includes a subset of
> your own proposed changes:
>
> http://cr.openjdk.java.net/~martin/webrevs/openjdk7/isSupplementaryCodePoint/
> http://cr.openjdk.java.net/~martin/webrevs/openjdk7/public-isBMPCodePoint/
>
- Maybe better: "... using a single {@code char}".
- Why don't you like using the new isBMPCodePoint() for isSupplementaryCodePoint() and
toUpperCaseCharArray() ?
- Same shift magic would enhance isISOControl(), isHighSurrogate(), isLowSurrogate(), in particular
if latter occur consecutive.
8-bit shift + compare would allow HotSpot to compile to smart 1-byte immediate op-codes.
- Don't you think my notes on validity are worth to add. (or separate bug ?)
- Changing ch <= MAX_SURROGATE to ch < MAX_SURROGATE + 1 would allow HotSpot compiler to optimize 1
branch if those methods are used consecutive.
- And at last, I would like to make the constants complete (= adding MAX_SUPPLEMENTARY_CODE_POINT).
> http://cr.openjdk.java.net/~martin/webrevs/openjdk7/Character-warnings/
>
Remembers me that some months ago I prepared a beautified version of Character's source (things like
above, replacing <code> against {@code}, indentation inconsistencies etc.) Would there be interest
to provide such a patch ?
> http://cr.openjdk.java.net/~martin/webrevs/openjdk7/malformed-utf8/
>
In encodeBufferLoop() you could use putChar(), putInt() instead put(). Should perform better.
> Sherman (or Alan),
>
> please review and/or file bugs for the above changes.
>
> isBMPCodePoint is a spec addition, requiring additional paperwork.
>
> Sherman, you owe me a response to my now-moldy proposed changes to
> the UTF-8 charset.
>
> The only controversial change would be the change in behavior in
> malformed-utf8, which I can take out.
>
This remembers me at some thoughts. To be *exact* I think malformed should be returned for all
codes, which are invalid in the regarding character set. So first validate for unmappable and second
for invalid (=malformed). Doesn't cost any performance in looping mappable and valid characters, but
little more effort after the loop is interrupted to form the right CoderResult.
-Ulf
More information about the core-libs-dev
mailing list