Cache which java classes are in a jar when opening jar the first time during classloading

Sun Aug 30 22:19:15 UTC 2015

Hello,

I have been looking through the JVM source related to class loading.
URLClassLoader#findClass calls URLClassPath#getResource
URLClassPath creates a "loader" for every entry on the classpath (e.g.
one JarLoader per jar file)

In getResource, it loops through all its loaders in order,
instantiating them lazily.
For example, it will only create a JarLoader and open a jar file
somewhere "farther along" the classpath if it did not find the
resource in all the prior jars

URLClassLoader#findClass and URLClassPath#getResource are doing linear
searches on all the entries on the classpath every time they need to
load a resource

For a jar file, if there is an index in META-INF, at least the
corresponding loader can figure out if the jar contains a class right
away.
If not, it searches an internal array/data structure created from the
zipfile central directory (see
jdk/src/share/native/java/util/zip/zip_util.c ZIP_GetEntry - if you
follow the call hiearchy from URLClassPath$JarLoader#getResource, you
end up at this function)

If the jars on the classpath are optimal (majority of the classes are
in the first few jars), there is not much overhead
However, when classes are located in multiple jars along the
classpath, the JVM spends nontrivial time searching through all of
them

One possible "solution" would be create a map of all resources ->
which jar/jar loader they belong in whenever a jar file is opened.
This can be done by iterating over JarFile#entries(), which just reads
the central directory from the jar/zip file (which is done anyways to
create some additional data structures when opening a jar/zip file)

I implemented this to try it out and for a java program with ~1800
classes, it improved the find class time (taken from
sun.misc.PerfCounter.getFindClassTime()) from ~1.4s to ~1s

I tried to think of reasons why this was not done already; looking
through the code, I believe the semantics of the loaders remain the
same.
There is technically a memory overhead of saving this map of resources
-> jar files/loaders, but improves the algorithm complexity from
O(number of jars on classpath) to O(1)

Would appreciate any feedback/insight as to whether this would be a
good change or why it is the way it currently is.
Thank you!

Best regards,
Adrian