Codereview request: CR 7040220 java/char_encodin Optimize UTF-8 charset for String.getBytes()/toCharArray()
Alan Bateman
Alan.Bateman at oracle.com
Mon May 2 14:31:16 UTC 2011
Xueming Shen wrote:
> Hi
>
> This is motivated by Neil's request to optimize common-case UTF8 path
> for native ZipFile.getEntry calls [1].
> As I said in my replying email [2] I believe a better approach might
> be to "patch" UTF8 charset directly to
> implement sun.nio.cs.ArrayDecoder/Encoder interface to speed up the
> coding operation for array based
> encoding/decoding under certain circumstance, as we did for all single
> byte charsets in #6636323 [3]. I
> have a old blog [4] that has some data for this optimization.
>
> The original plan was to do the same thing for our new UTF8 [5] as
> well in JDK7, but then (excuse, excuse)
> I was just too busy to come back to this topic till 2 days ago. After
> two days of small tweaking here and there
> and testing those possible corner cases I can think of, I'm happy with
> the result and think it might be
> worth sending it out for a codereview for JDK7, knowing we only have
> couple days left.
>
> The webrev is at
>
> http://cr.openjdk.java.net/~sherman/7040220/webrev
I went through the changes and the approach looks good to me - thanks
for jumping on this. Thanks Ulf for helping review. Also thanks Neil for
reporting this and testing with Sherman's change to verify that it
addresses the performance regression.
Sherman - just a couple of minor comments:
It would be good to put a blank line after the ASCII-only loops so that
future maintainers can easily distinguish the loops (this would make it
consistent with decodeArrayLoop for example).
In several places the test is "if (CodingErrorAction.REPLACE !=
malformedInputAction())". Personally I would swap this to "if
(malformedInputAction() != CodingErrorAction.REPLACE)". This reminds, if
a couple of these slipped through into the zip code recently, eg: "if
(false == inf.ended())", "if (false == streams.isEmpty())".
The caching of the ByteBuffer and getBBuffer for the malformed case
isn't as nice as the original code. Minimally I would move the
declaration of bb so that it's not in the middle of dl and dp, and
rename getBBuffer. Alternatively I would just get rid of it as it
shouldn't be performance critical.
In TestStringCodeUTF8 it might be cleaner to put the body of main into
its own method and then call it once, set the security manager, and then
call it again.
That's all I have,
-Alan.
More information about the core-libs-dev
mailing list