Cache which java classes are in a jar when opening jar the first time during classloading

Mon Aug 31 06:01:26 UTC 2015

Hi Jonathan,

I'm not aware of any specific facilities for detecting that a jar file
was modified
I believe as it is it doesn't update its internal structures (the
jzfile and jzcells and jzentries) if it was modified from an outside
source

If I am wrong about that (I don't see any code in class loading/jar
files related to updating itself when modified from outside, but
assuming it does do that), a class -> jar map could still be used to
figure out what jar to start at - something like:

Loader l = cmap.get(name);
if (l != null) {
    Resource res = l.getResource(name);
    if (res != null) {
        return res;
    } else {
        // invalidate cache
        cmap.remove(name);
    }
}

// existing code
for (int i = 0; (loader = getLoader(i)) != null; i++) {

So it will "lazily" invalidate the cache when it can't actually find
the resource in a jar

If a resource was added to a jar closer to the beginning of the
classpath, you would want to load the file from the earlier jar
Of course I don't know for sure, but I can't really imagine use cases
of doing that
If I understand correctly, classloaders are intended to be extensible
so that people can write their custom classloaders for tricky use
cases - in which case they can implement a classloader whose
getResource doesn't use a cache
However, I think the majority of java applications with long
classpaths would benefit from this - for example, I believe HDFS has
~100 jars on its classpath

Thanks for your reply!
Please let me know what you think of this

Best regards,
Adrian

On Sun, Aug 30, 2015 at 11:02 PM, Jonathan Yu <jawnsy at cpan.org> wrote:
> Hi Adrian,
>
> It's possible for jar files to be modified while the JVM is running - is
> there some facility for detecting that an archive was modified and thus
> invalidating the cache?
>
> Also, I wonder how class data sharing might interact with this, though I'll
> admit that I don't know much about HotSpot (I use the IBM JVM).
>
>
> On Sun, Aug 30, 2015, 18:20 Adrian <withoutpointk at gmail.com> wrote:
>>
>> Hello,
>>
>> I have been looking through the JVM source related to class loading.
>> URLClassLoader#findClass calls URLClassPath#getResource
>> URLClassPath creates a "loader" for every entry on the classpath (e.g.
>> one JarLoader per jar file)
>>
>> In getResource, it loops through all its loaders in order,
>> instantiating them lazily.
>> For example, it will only create a JarLoader and open a jar file
>> somewhere "farther along" the classpath if it did not find the
>> resource in all the prior jars
>>
>> URLClassLoader#findClass and URLClassPath#getResource are doing linear
>> searches on all the entries on the classpath every time they need to
>> load a resource
>>
>> For a jar file, if there is an index in META-INF, at least the
>> corresponding loader can figure out if the jar contains a class right
>> away.
>> If not, it searches an internal array/data structure created from the
>> zipfile central directory (see
>> jdk/src/share/native/java/util/zip/zip_util.c ZIP_GetEntry - if you
>> follow the call hiearchy from URLClassPath$JarLoader#getResource, you
>> end up at this function)
>>
>> If the jars on the classpath are optimal (majority of the classes are
>> in the first few jars), there is not much overhead
>> However, when classes are located in multiple jars along the
>> classpath, the JVM spends nontrivial time searching through all of
>> them
>>
>> One possible "solution" would be create a map of all resources ->
>> which jar/jar loader they belong in whenever a jar file is opened.
>> This can be done by iterating over JarFile#entries(), which just reads
>> the central directory from the jar/zip file (which is done anyways to
>> create some additional data structures when opening a jar/zip file)
>>
>> I implemented this to try it out and for a java program with ~1800
>> classes, it improved the find class time (taken from
>> sun.misc.PerfCounter.getFindClassTime()) from ~1.4s to ~1s
>>
>> I tried to think of reasons why this was not done already; looking
>> through the code, I believe the semantics of the loaders remain the
>> same.
>> There is technically a memory overhead of saving this map of resources
>> -> jar files/loaders, but improves the algorithm complexity from
>> O(number of jars on classpath) to O(1)
>>
>> Would appreciate any feedback/insight as to whether this would be a
>> good change or why it is the way it currently is.
>> Thank you!
>>
>> Best regards,
>> Adrian