Codereview request: CR 7040220 java/char_encodin Optimize UTF-8 charset for String.getBytes()/toCharArray()

Thu Apr 28 19:56:34 UTC 2011

On 04/28/2011 05:44 AM, Ulf Zibis wrote:
>
> In malformed(byte[] src, int sp, int nb) I think you could cache the 
> ByteBuffer bb, instead instantiating a new one all the time. For this 
> the method should not be static to ensure thread-safety.

I was assuming that in scenario that you have malformed byte(s) in your 
input bytes
during String.toCharAray()/getBytes() coding, the performance probably 
is no longer
your top priority. That said, you do have the point, we should do better 
even in
malformed case, to wrap the input bytes every time there is a malformed 
byte is
definitely not preferred. The webrev has been updated to "cache" a 
ByteBuffer wrapper
object for each round of decode/encode() operation, when necessary 
(means if a
malformed detected).

http://cr.openjdk.java.net/~sherman/7040220/webrev

(the previous one is at 
http://cr.openjdk.java.net/~sherman/7040220/webrev.00)

Thanks,
-Sherman

>
>
>
> Am 28.04.2011 08:34, schrieb Xueming Shen:
>>  Hi
>>
>> This is motivated by Neil's request to optimize common-case UTF8 path 
>> for native ZipFile.getEntry calls [1].
>> As I said in my replying email [2] I believe a better approach might 
>> be to "patch" UTF8 charset directly to
>> implement sun.nio.cs.ArrayDecoder/Encoder interface to speed up the 
>> coding operation for array based
>> encoding/decoding under certain circumstance, as we did for all 
>> single byte charsets in #6636323 [3]. I
>> have a old blog [4] that has some data for this optimization.
>>
>> The original plan was to do the same thing for our new UTF8 [5] as 
>> well in JDK7, but then (excuse, excuse)
>> I was just too busy to come back to this topic till 2 days ago. After 
>> two days of small tweaking here and there
>> and testing those possible corner cases I can think of, I'm happy 
>> with the result and think it might be
>> worth sending it out for a codereview for JDK7, knowing we only have 
>> couple days left.
>>
>> The webrev is at
>>
>> http://cr.openjdk.java.net/~sherman/7040220/webrev
>>
>> Those tests are supposed to make sure the coding result from the new 
>> paths for String.getBytes()/
>> toCharArray() matches the result from the existing implementation.
>>
>> The performance results of running StrCodingBenchmarkUTF8 (included 
>> in webrev) on my linux
>> box in -client and -server mode respectively are included at
>>
>> http://cr.openjdk.java.net/~sherman/7040220/client
>> http://cr.openjdk.java.net/~sherman/7040220/server
>>
>> The microbenchmark measures 1-byte, 2-byte, 3-byte and 4 bytes utf8 
>> bits separately with different
>> length of data (from 12 bytes to thousands)
>>
>> Thanks!
>> -Sherman
>>
>> [1] 
>> http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-April/006710.html
>> [2] 
>> http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-April/006726.html
>> [3] http://cr.openjdk.java.net/~sherman/6636323_6636319/webrev
>> [4] http://blogs.sun.com/xuemingshen/entry/faster_new_string_bytes_cs
>> [5] http://blogs.sun.com/xuemingshen/entry/the_big_overhaul_of_java
>>