Codereview request: CR 7040220 java/char_encodin Optimize UTF-8 charset for String.getBytes()/toCharArray()

Thu Apr 28 22:46:41 UTC 2011

Am 28.04.2011 23:28, schrieb Xueming Shen:
> On 04/28/2011 01:55 PM, Ulf Zibis wrote:
>> Am 28.04.2011 21:56, schrieb Xueming Shen:
>>> That said, you do have the point, we should do better even in
>>> malformed case, ...
>> Yes, that's what I wanted to point on.
>> But I thought, you could go 1 step further, declaring bb as member of UTF_8.Decoder. Then it 
>> should be guaranteed, the a decoder is in use of only one thread at same time. Don't know if that 
>> is the case for the typical use cases?
>
> Why do you want to "re-use" a ByteBuffer object cross decode(byte[]...) invocations?
> I don't see any benefit of doing that.
Thinking again, I see my error. It's not re-usable, because it's size is always different, so 
question about the benefit seems obsolete. The benefit could have been: If the strings are kinda 
short, AND malformed case is kinda frequent, newly instantiations of ByteBuffers could decrease the 
overall performance in some percentage.

>
>> In http://cr.openjdk.java.net/~mduigou/4884238/2/webrev/ I've seen the change to use a constant 
>> Charset object instead of a constant charset name on some method calls. From your benchmark it 
>> seems, using constant charset names has some little performance gain (0..25 %) , so I don't see 
>> the benefit of the changes from 4884238 in contrary direction.
>>
>
> That is a totally different topic:-)
>
> Yes, you don't benefit from using a "Charset object"  when do String.getBytes()/toCharArray()
> because of our caching optimization in StringCoding class. But that is a pure implementation
> detail.
I think, this fact should be mentioned in the javadoc of String.getBytes() etc. I guess, standard 
programmer would estimate the StandardCharset.UTF_8 version faster than the csn version.

> It's safe to say that java.nio.cs.StandardCharset is not for String.getBytes()/toCharArray()
> only, so the fact that "cs" variant of String.getBytes()/toCharArray() is "slower" than its "csn"
> variant arguably might not be a very strong/supportive material for that discussion:-)
So what prevents us from the same caching optimization in ZipCoder etc. class ?


- ZipCoder.isutf8 is unreadeable. Better: isUTF8

- ArrayDecoder.decode(ba, 0, length, ca) could throw MalformedInput/UnmappableCharacterException 
instead returning -1. Benefits:
-- prevent from translating -1 to IllegalArgumentException("MALFORMED") in ZipCoder etc.
-- more precise exception


-Ulf