Reducing Garbage Generated by URLClassLoader
Xueming Shen
xueming.shen at oracle.com
Mon Dec 5 05:58:07 UTC 2016
On 12/4/16, 1:21 PM, Scott Palmer wrote:
> Excuse me if this is the wrong list for this discussion. Please direct me to the right place if this isn’t it.
>
> When doing an analysis of garbage generation in our application we discovered a significant number of redundant strings generated by the class loader. In my case there are hundreds of jars on the classpath - everything in the application is a plugin. I figured on average 10kB of useless garbage char[]s were generated per findResource call for plugin resources.
>
> This is caused mostly by the ZipFile implementation. What is the purpose of java.util.zip.ZipCoder’s byte[] getBytes(String s) method? It seems to simply be a custom implementation of string.getBytes(CharSet cs) and as such needs to first make a copy of the char[] to work on.
The "entry name" stored in the zip/jar file is not encoded as a UTF16
char sequence but bytes in
some "native" encodings, utf8 is one of these encodings the ZipFile
supports. The default one for
a jar file is utf8. So when you want to lookup a resource from the jar
file with a name as a String
object, we have to convert/encode this "name" from String into the
corresponding byte[] in utf8
and do a hash table lookup to find the resource. Here are some
implementation details
(1) why do we need a "custom" version in ZipFile. This is because
String.getBytes(cs) replaces
unmappable/malformed chars with "?" silently, ZipFile API needs to throw
an corresponding
exception in this scenario, so we have to have a "custom" version to do it.
(2) for performance reason we don't want to convert all jar entry names
in all open jar file into
either String or char[] in advance, they are kept as byte[] in their
original form and we don't even
have a single byte[] copy for each entry name, all names are kept in
their original "cen" table form
in byte[] and we only have a "offset" to each entry's offset. We are
talking about hundreds of
jars and each jar has hundreds if not thousands of entries. Arguably we
can do the other way
around, always convert those entry names in each open jar file to
String, and then we don't have
to do the String->byte[] during lookup. It's a design decision. If there
is enough evidence
suggests otherwise, it can be changed/doable, given we now have all the
implementation at
Java level in jdk9.
That said, given the optimization we have done for String in jdk9, it
might be worth considering
to have a fast path for those ascii-only entry names (I would assume
99.9%+ of the entry names
are ascii-only in real world), then it should take a simple byte[] copy
to convert/encode those
entry names from String to byte[].
sherman
> This combined with the need to operate on byte[] path names internally in the ZipFile implementation means that URLClassLoader generates a lot of unnecessary garbage in a findResource call - proportional to the number of jars on the classpath.
>
> Since JarFile forces the ZipFile to be open with UTF-8 always, if there was some API exposed that took a byte[] for the resource name, all of that extra string copying and encoding could be hoisted out of the loop in sun.misc.URLClassPath. Would this be worth it creating an internal class for something like a ‘ClasspathJarFile’ to and tweaking ZipFile so the byte[] based method is protected instead of private?
>
> I also noticed that sun.net.www.ParseUtil.encodePath(String, boolean) usually had nothing useful to do but still made three copies of the string passed in anyway (two char arrays to work on, and the String returned).
>
>
>
> Cheers,
>
> Scott
>
More information about the core-libs-dev
mailing list