Cache which java classes are in a jar when opening jar the first time during classloading

Mon Aug 31 06:31:44 UTC 2015

On 31/08/2015 4:01 PM, Adrian wrote:
> Hi Jonathan,
>
> I'm not aware of any specific facilities for detecting that a jar file
> was modified
> I believe as it is it doesn't update its internal structures (the
> jzfile and jzcells and jzentries) if it was modified from an outside
> source
>
> If I am wrong about that (I don't see any code in class loading/jar
> files related to updating itself when modified from outside, but
> assuming it does do that), a class -> jar map could still be used to
> figure out what jar to start at - something like:
>
> Loader l = cmap.get(name);
> if (l != null) {
>      Resource res = l.getResource(name);
>      if (res != null) {
>          return res;
>      } else {
>          // invalidate cache
>          cmap.remove(name);
>      }
> }
>
> // existing code
> for (int i = 0; (loader = getLoader(i)) != null; i++) {
>
> So it will "lazily" invalidate the cache when it can't actually find
> the resource in a jar
>
> If a resource was added to a jar closer to the beginning of the
> classpath, you would want to load the file from the earlier jar
> Of course I don't know for sure, but I can't really imagine use cases
> of doing that

Such behaviour is not prohibited though so any caching mechanism would 
have to allow for this - and I don't see how it could effectively do so.

> If I understand correctly, classloaders are intended to be extensible
> so that people can write their custom classloaders for tricky use
> cases - in which case they can implement a classloader whose
> getResource doesn't use a cache

I think this would have to be the other way around. If your classloader 
constrains dynamic changes to how it locates classes then it can 
implement a cache.

I believe the upcoming module system would address this issue to some 
extent.

David
-----

> However, I think the majority of java applications with long
> classpaths would benefit from this - for example, I believe HDFS has
> ~100 jars on its classpath
>
> Thanks for your reply!
> Please let me know what you think of this
>
> Best regards,
> Adrian
>
>
>
> On Sun, Aug 30, 2015 at 11:02 PM, Jonathan Yu <jawnsy at cpan.org> wrote:
>> Hi Adrian,
>>
>> It's possible for jar files to be modified while the JVM is running - is
>> there some facility for detecting that an archive was modified and thus
>> invalidating the cache?
>>
>> Also, I wonder how class data sharing might interact with this, though I'll
>> admit that I don't know much about HotSpot (I use the IBM JVM).
>>
>>
>> On Sun, Aug 30, 2015, 18:20 Adrian <withoutpointk at gmail.com> wrote:
>>>
>>> Hello,
>>>
>>> I have been looking through the JVM source related to class loading.
>>> URLClassLoader#findClass calls URLClassPath#getResource
>>> URLClassPath creates a "loader" for every entry on the classpath (e.g.
>>> one JarLoader per jar file)
>>>
>>> In getResource, it loops through all its loaders in order,
>>> instantiating them lazily.
>>> For example, it will only create a JarLoader and open a jar file
>>> somewhere "farther along" the classpath if it did not find the
>>> resource in all the prior jars
>>>
>>> URLClassLoader#findClass and URLClassPath#getResource are doing linear
>>> searches on all the entries on the classpath every time they need to
>>> load a resource
>>>
>>> For a jar file, if there is an index in META-INF, at least the
>>> corresponding loader can figure out if the jar contains a class right
>>> away.
>>> If not, it searches an internal array/data structure created from the
>>> zipfile central directory (see
>>> jdk/src/share/native/java/util/zip/zip_util.c ZIP_GetEntry - if you
>>> follow the call hiearchy from URLClassPath$JarLoader#getResource, you
>>> end up at this function)
>>>
>>> If the jars on the classpath are optimal (majority of the classes are
>>> in the first few jars), there is not much overhead
>>> However, when classes are located in multiple jars along the
>>> classpath, the JVM spends nontrivial time searching through all of
>>> them
>>>
>>> One possible "solution" would be create a map of all resources ->
>>> which jar/jar loader they belong in whenever a jar file is opened.
>>> This can be done by iterating over JarFile#entries(), which just reads
>>> the central directory from the jar/zip file (which is done anyways to
>>> create some additional data structures when opening a jar/zip file)
>>>
>>> I implemented this to try it out and for a java program with ~1800
>>> classes, it improved the find class time (taken from
>>> sun.misc.PerfCounter.getFindClassTime()) from ~1.4s to ~1s
>>>
>>> I tried to think of reasons why this was not done already; looking
>>> through the code, I believe the semantics of the loaders remain the
>>> same.
>>> There is technically a memory overhead of saving this map of resources
>>> -> jar files/loaders, but improves the algorithm complexity from
>>> O(number of jars on classpath) to O(1)
>>>
>>> Would appreciate any feedback/insight as to whether this would be a
>>> good change or why it is the way it currently is.
>>> Thank you!
>>>
>>> Best regards,
>>> Adrian