Request for review: Optimize common-case UTF8 path for native ZipFile.getEntry calls

Tue Apr 26 20:14:16 UTC 2011

Hi Neil,

Thanks for looking into this performance problem.

I had the assumption that

(1) code conversion is really a small portion of the overall Zip 
read/write operation
(2) I will have the time to do the same optimization for UTF_8 charset 
as I did for
      other singlebyte charsets [2] in JDK7 time frame

when worked on the changeset 1aaeb8fbe705 [1]. It appears I was a little 
optimistic
on my planning :-( just never had chance to go back to my charset work 
after that.

But obviously if we only look at the utf8 conversion performance, to use 
the standard
UTF_8 charset is slower compared to the original embedded utf8 
conversion methods.

The patch definitely improves the situation, I did some measurement 
simply on
ZipFile.getEntry(String) as showed below on a jar file with a single 
entry named
"abcdefghijklmnopqrstuvwxyz".

     public static void main(String[] args) throws Throwable {
         ZipFile zf = new ZipFile(args[0]);
         String ename = args[1];
         if (zf.getEntry(ename) == null) {
             System.out.println("no such entry:" + ename);
             System.exit(1);
         }
         // warmup
         for (int i = 0; i < 10000; i++) {
             zf.getEntry(ename);
         }
         long t0 = System.currentTimeMillis();
         for (int i = 0; i < 100000; i++) {
             zf.getEntry(ename);
         }
         long t1 = System.currentTimeMillis();
         System.out.printf("time=%d%n", t1-t0);
         zf.close();
     }

It shows the WithoutPath/WithPatch ratio = 125/107  (this is not an 
accurate measurement
of the utf8 conversion though, just give us a rough idea)

However,  the approach

(1) is only about the zc.getBytesUTF8(), we should do the same thing for 
zc.toStringUTF8().
And ZipIn/OutputStream also uses ZipCoder.

(2) is kinda against the direction that we are planning to do in the 
future release, in which we
might want to bring most of the ZipFile access code up to Java level 
from native C code, we
have a pure Java ZipFile (no zlib, yet) implementation in the zip 
filesystem code at
src/share/demo/nio/zipfs already, we should probably do the same thing 
for ZipFile as well.
Sure we will have to keep the native code anyway for vm access during boot.

(3) And once we put the UTF-8 optimization work in, we probably have to 
again back this one
out. I do have the UTF_8 optimization code ready (not fully tested 
though), I also ran the same
test above, I got the number around "85".

The webrev for the utf-8 optimization is at
http://cr.openjdk.java.net/~sherman/utf8Str/webrev
with the "improvement" showed at [3] [4]
Any codereview volunteer ? :-)  maybe just too late for 7.

-Sherman

[1] http://hg.openjdk.java.net/jdk7/jdk7/jdk/rev/1aaeb8fbe705
[2] http://blogs.sun.com/xuemingshen/entry/faster_new_string_bytes_cs
[3] http://cr.openjdk.java.net/~sherman/utf8Str/client 
<http://cr.openjdk.java.net/%7Esherman/utf8Str/client>
[4] http://cr.openjdk.java.net/~sherman/utf8Str/server 
<http://cr.openjdk.java.net/%7Esherman/utf8Str/server>

On 04/25/2011 09:45 AM, Neil Richards wrote:
> Hi,
> Changeset 1aaeb8fbe705 [1] implemented a good set of enhancements, to
> allow the zip code to properly support zip files containing names and
> comments either encoded in non-UTF-8 codesets, or which contain
> supplementary characters.
>
> This support is achieved by doing the character conversion in Java code
> (using the code in ZipCoder.getBytes()), then feeding the resulting
> byte[] through to native code.
>
> This approach is more costly, both in terms of memory allocation and
> amount of data copying, than the preceding one, which did the conversion
> directly into a native byte[] using JNI's 'GetStringUTFRegion', which
> assumes a codeset of "modified UTF-8".
>
> The conversion performed for "modified UTF-8" is the same as that for
> standard UTF-8 for all codepoints except those for supplementary
> characters or for the character '\u0000' [2].
>
> As the "common-case" is likely to be dealing with zip files with UTF-8
> encoded names and comments containing neither supplementary characters
> nor '\u0000', the increase in cost of the conversion in this case is
> unfortunate.
>
> By introducing an extra check to determine if a UTF-8 encoded string may
> be safely converted using "modified UTF-8" (ie. to check that it has not
> any supplementary codepoints or '\u0000' in it), and using the
> aforementioned JNI routine to do the conversion if the check succeeds,
> I've noticed a decent performance improvement can be achieved by
> catering for this common-case.
>
> This benefit is particularly significant for applications which do a lot
> of rummaging around in zip files, or those involving lots of jar files
> and/or jar files with lots of entries in them.
>
> Please find below a changeset which implements the check I describe
> above ("isSafeToUseModifiedUTF8"), and uses it to choose whether to call
> a new modified-UTF8-specific variant of the native getEntry() method
> ("getEntryByModifiedUTF8"), which makes use of the JNI
> 'GetStringUTFRegion' method for the conversion.
>
> Please let me know if you have and comments, criticism of suggestions on
> this,
> Thanks,
> Neil
>
> [1] http://hg.openjdk.java.net/jdk7/jdk7/jdk/rev/1aaeb8fbe705
> [2] http://java.sun.com/developer/technicalArticles/Intl/Supplementary/#Modified_UTF-8
>