URLClassPath does unnecessary linear search through every jar and jar entry to find resource

Tue Sep 15 15:31:13 UTC 2015

Hi Adrian,

I looked briefly at your code and claims and have some comments inline...

On 09/13/2015 11:08 AM, Adrian wrote:
> Hi,
>
> I posted about this earlier (few weeks ago), and got some responses
> about concerns which I addressed in my last email, though I didn't
> hear back about it.
> My apologies if I shouldn't be sending this; I'm not sure what the
> protocol is about this stuff
>
> Classloading on a standard Java application with jars on the classpath
> currently does a linear search through every jar on the classpath, and
> every entry in a jar, for every class loaded.
> As URLClassPath searches for an entry/resource/class, it's possible to
> cache each entry it encounters -> where to find it, so in the future
> if a resource has already been seen we don't need to repeat the ~2d
> search
>
> Original thread (august list):
> http://mail.openjdk.java.net/pipermail/core-libs-dev/2015-August/035009.html
> Last message (september list):
> http://mail.openjdk.java.net/pipermail/core-libs-dev/2015-September/035016.html
>
> I got 3 responses: 2 concerning changes to jars at runtime
> (invalidating cache), and 1 saying you're not supposed to modify jars
> at runtime (can confirm via source code, and manual testing - it
> crashes the JVM)
>
> In my last message I addressed
> - jars being modified (which you're not supposed to do; the current
> classloader does not handle this) or the classpath changing (only
> possible if you make public fields/methods via reflection, and this
> could easily be handled gracefully)
> - some details of the finding resource process (e.g. the meta index,
> when the cache for jar entries can't be used because of the semantics
> of other loaders/types of entries on the classpath)
> - a reference implementation of caching that I believe is simple and
> compliant with existing functionality
> - some basic numbers on performance
>
> ---
> So in this email I wanted to explain the problem again, hopefully more
> clearly now
>
> URLClassPath is used by URLClassLoader to find classes, though it
> could be used for finding any resource on a classpath.
> URLClassPath keeps an array of URLs, which are typically folders or
> jars on the local filesystem.
> They can be http or ftp or other files, but that's not
> relevant/doesn't affect this problem
>
> To find a resource/class (URLClassPath#getResource), it:
> 1. Loops through the URLs in order
> 2. Creates Loader objects for each URL lazily (URLClassPath$FileLoader
> or URLClassPath$JarLoader). So if the Loader for the first URL finds
> all the resources, Loaders for the remaining entries on the classpath
> are never created/looked at
> 3. Calls Loader#getResource and returns the resource if found
> (otherwise keep searching)
>
> URLClassPath$JarLoader creates its corresponding JarFile either in the
> constructor or in getResource (depending on the meta index - the
> details I explained in my last email I won't repeat)
> When a JarFile is created, it opens the jar file on the file system,
> reads the central directory of the jar/zip file, and creates an
> internal linked list with all its entries

It's not actually a linked list, but a hash table. See 
jdk/src/java.base/share/native/libzip/zip_util.[c,h] ...

the jzfile.table is an array of jint mapping (entry-name-hash-code % 
table.length) -> index into jzfile.entries
the jzfile.entries is an array of jzcell which represent heads of hash 
buckets that are linked with jzcell.next

So look-ups into individual jar/zip files should be O(1). But each 
look-up does cross the Java/JNI boundary, so it has a fixed JNI overhead 
too. If there are lots of jar files in class path, you pay for the 
unsuccessful hash-table look-ups before finding the resource. This 
overhead is not that big, but increases with the number of jar files in 
class path. Each look-up into class path is therefore O(N) where N = # 
of jar files...

>
> JarFile objects are immutable; you can only open them for read/delete
> (see constructor API
> http://docs.oracle.com/javase/7/docs/api/java/util/jar/JarFile.html#JarFile(java.io.File,%20boolean,%20int)
> ), they do not detect if the file has been modified externally, and
> you only "append" or "delete" entries by creating a new jar
> (JarOutputStream)
>
> When URLClassPath$JarLoader#getResource calls JarFile#getEntry; in the
> C code it searches through the linked list
> (jdk/src/share/native/java/util/zip/zip_util.c, ZIP_GetEntry - jar
> files are just zip files, and the java JarFile object just extends
> ZipFile)
>
> Since the order in which jar files and jar entries are searched is
> invariant, we can create a map of resource -> first jar which contains
> it
>
> However, we don't want to introduce additional overhead.
> When a JarFile is created, it already builds the internal linked list
> - it's O(number of entries)
> I propose that after the JarLoader creates the JarFile, it iterates
> through its entries and adds them to the cache (if the map does not
> already contain the resource, i.e. an earlier jar contains the
> resource)
> This adds a small overhead when instantiating loaders - but creating
> the JarLoader/JarFile is still technically O(number of entries), and
> now getResource is constant time, instead of requiring a linear search
> through every jar and the linked list of entries for each jar
> (O(number of jars * entries/jar))

Have you measured your entry iteration and cache initialization 
overhead? When iterating over JarEntries, the code invokes 10 native 
methods for each returned JarEntry:

         long jzentry = getNextEntry(jzfile, i++)
         getEntryFlag(jzentry);
         getEntryBytes(jzentry, JZENTRY_NAME)
         getEntryTime(jzentry)
         getEntryCrc(jzentry)
         getEntrySize(jzentry)
         getEntryCSize(jzentry)
         getEntryMethod(jzentry)
         getEntryBytes(jzentry, JZENTRY_EXTRA)
         getEntryBytes(jzentry, JZENTRY_COMMENT)

...it uses CharsetDecoder 1 or 2 times to decode the name and optional 
comment of each JarEntry, it allocates several java objects...

So I have a feeling that initializing your cache when JarFiles are 
constructed, would increase start-up time and not decrease it. It may 
pay-of on the long run though. For achieving short start-up time, JDK 
tries to be as lazy as possible. Your cache goes against that.

ZipFile native code tries hard to keep the memory usage down for 
maintaining the look-up hash table. It only keeps hash codes in the 
table, with offsets into memory mapped ZIP central directory for each 
entry. Your proposed java-side cache is also a hash table, but it's 
memory footprint is much bigger.

I have made a little experiment to see if JNI overhead for negative 
answers from ZipFile.getEntry(name) is really that big. I modified 
ZipFile.java/ZipFile.c to expose access to a look-up hash table that is 
maintained in native code to the java side via two direct ByteBuffers. I 
screen each ZipFile.getEntry(name) with a probe that gives a negative 
answer for majority of negative lookups without invoking JNI methods. 
Here are the changes I made for this experiment:

http://cr.openjdk.java.net/~plevart/jdk9-dev/ZipFile.getEntry/webrev.01/

I have tested the changed JDK with the following JMH benchmark:

http://cr.openjdk.java.net/~plevart/jdk9-dev/ZipFile.getEntry/ZipFileBench.java

This benchmark measures the ZipFile.getEntry(name) method on 2 jar files 
(rt.jar ~20K entries, idea.jar ~40K entries) looking up entries in 1st 
jar file with names of the entries taken from the 2nd jar file:

Original:

Benchmark                        (_zipTuple)  Mode  Samples Score    
Error  Units
j.t.ZipFileBench.getEntry      rt.jar:rt.jar  avgt        8  721.536 ± 
17.344  ns/op
j.t.ZipFileBench.getEntry    rt.jar:idea.jar  avgt        8  501.451 ± 
10.298  ns/op
j.t.ZipFileBench.getEntry    idea.jar:rt.jar  avgt        8  423.513 ± 
23.268  ns/op

Patched:

Benchmark                        (_zipTuple)  Mode  Samples Score     
Error  Units
j.t.ZipFileBench.getEntry      rt.jar:rt.jar  avgt        8  743.281 ±  
13.467  ns/op
j.t.ZipFileBench.getEntry    rt.jar:idea.jar  avgt        8  392.569 ±   
7.710  ns/op
j.t.ZipFileBench.getEntry    idea.jar:rt.jar  avgt        8  333.249 ±  
14.069  ns/op

The rt.jar:rt.jar therefore gives the score for successful look-ups, 
while rt.jar:idea.jar and idea.jar:rt.jar give the score for 
unsuccessful look-ups.

The screening of JNI call improves unsuccessful lookup by 20-25% while 
not hurting much the successful lookup. The successful look-up could be 
improved so that it wouldn't have any overhead by utilizing the result 
of the screening probe that must now be re-computed in native code once 
more (see TODO).

So is this worth the trouble? I don't know.

Adrian, does this patch improve your situation (don't forget to set the 
system property -Dsun.zip.useNativeTableProbe=true to enable the 
feature)? This patch should not hurt startup performance, as it does not 
maintain or initialize any additional data structures (besides 2 
ByteBuffer objects per ZipFile that only expose native memory to Java code).

Regards, Peter

>
> There are several caveats when the cache cannot be used with non-jar
> URLs on the classpath, and the meta index, but I explain them in my
> last email along with comments in the reference implementation
>
> ---
> Regarding modified jars:
> - moved/renamed: the file handle is still valid and it doesn't affect
> the JVM/classloading
> - deleted: the file handle is still valid and it doesn't affect the
> JVM/classloading
> - modified: the JVM crashes
>
> The first two may not be intuitive, but remember that file handles
> point to files; not paths on the filesystem.
> So even though a jar appears renamed in the shell, the java process
> has opened a file, somewhere in the c implementation of file objects
> it has the file handle, and when the JVM does the system call read on
> the file handle say to read the class from the jar file, it all works
> fine
> For what it's worth, here's a stack overflow answer as "source":
> http://stackoverflow.com/questions/2028874/what-happens-to-an-open-file-handler-on-linux-if-the-pointed-file-gets-moved-de
>
> ---
> There is a protected method URLClassLoader#addURL which appends a URL
> to the classpath.
> People could use reflection to make it public.
> Because jars are opened lazily and the cache is also built lazily
> whenever a jar is opened, it doesn't matter if paths are appended
>
> Regarding people making extensive use of reflection to modify the
> order of entries on the classpath, I believe that's irrelevant as
> that's clearly not the semantic of URLClassLoader/URLClassPath.
> People who need custom classloading rules create custom classloaders;
> that's the purpose of classloaders being extensible
>
> ---
> Anyways, I hope this was discussion worthy.
> I've looked much into this and believe I haven't missed anything, but
> if someone knows why it hasn't/can't be done any insight would be
> appreciated!
> Alan from the last email thread said "There was a lot of experiments
> and prototypes in the past along these lines" - are the results
> public?
> He also mentioned improving classloading in Java's upcoming module
> system (originally planned for Java 7, currently delayed to Java 9),
> but I believe the algorithmic complexity and performance of
> URLClassLoader could be improved without complicated changes
>
> Please let me know what you think, and thanks for your time!
>
> Best regards,
> Adrian