URLClassPath does unnecessary linear search through every jar and jar entry to find resource

Sun Sep 13 09:08:52 UTC 2015

Hi,

I posted about this earlier (few weeks ago), and got some responses
about concerns which I addressed in my last email, though I didn't
hear back about it.
My apologies if I shouldn't be sending this; I'm not sure what the
protocol is about this stuff

Classloading on a standard Java application with jars on the classpath
currently does a linear search through every jar on the classpath, and
every entry in a jar, for every class loaded.
As URLClassPath searches for an entry/resource/class, it's possible to
cache each entry it encounters -> where to find it, so in the future
if a resource has already been seen we don't need to repeat the ~2d
search

Original thread (august list):
http://mail.openjdk.java.net/pipermail/core-libs-dev/2015-August/035009.html
Last message (september list):
http://mail.openjdk.java.net/pipermail/core-libs-dev/2015-September/035016.html

I got 3 responses: 2 concerning changes to jars at runtime
(invalidating cache), and 1 saying you're not supposed to modify jars
at runtime (can confirm via source code, and manual testing - it
crashes the JVM)

In my last message I addressed
- jars being modified (which you're not supposed to do; the current
classloader does not handle this) or the classpath changing (only
possible if you make public fields/methods via reflection, and this
could easily be handled gracefully)
- some details of the finding resource process (e.g. the meta index,
when the cache for jar entries can't be used because of the semantics
of other loaders/types of entries on the classpath)
- a reference implementation of caching that I believe is simple and
compliant with existing functionality
- some basic numbers on performance

---
So in this email I wanted to explain the problem again, hopefully more
clearly now

URLClassPath is used by URLClassLoader to find classes, though it
could be used for finding any resource on a classpath.
URLClassPath keeps an array of URLs, which are typically folders or
jars on the local filesystem.
They can be http or ftp or other files, but that's not
relevant/doesn't affect this problem

To find a resource/class (URLClassPath#getResource), it:
1. Loops through the URLs in order
2. Creates Loader objects for each URL lazily (URLClassPath$FileLoader
or URLClassPath$JarLoader). So if the Loader for the first URL finds
all the resources, Loaders for the remaining entries on the classpath
are never created/looked at
3. Calls Loader#getResource and returns the resource if found
(otherwise keep searching)

URLClassPath$JarLoader creates its corresponding JarFile either in the
constructor or in getResource (depending on the meta index - the
details I explained in my last email I won't repeat)
When a JarFile is created, it opens the jar file on the file system,
reads the central directory of the jar/zip file, and creates an
internal linked list with all its entries

JarFile objects are immutable; you can only open them for read/delete
(see constructor API
http://docs.oracle.com/javase/7/docs/api/java/util/jar/JarFile.html#JarFile(java.io.File,%20boolean,%20int)
), they do not detect if the file has been modified externally, and
you only "append" or "delete" entries by creating a new jar
(JarOutputStream)

When URLClassPath$JarLoader#getResource calls JarFile#getEntry; in the
C code it searches through the linked list
(jdk/src/share/native/java/util/zip/zip_util.c, ZIP_GetEntry - jar
files are just zip files, and the java JarFile object just extends
ZipFile)

Since the order in which jar files and jar entries are searched is
invariant, we can create a map of resource -> first jar which contains
it

However, we don't want to introduce additional overhead.
When a JarFile is created, it already builds the internal linked list
- it's O(number of entries)
I propose that after the JarLoader creates the JarFile, it iterates
through its entries and adds them to the cache (if the map does not
already contain the resource, i.e. an earlier jar contains the
resource)
This adds a small overhead when instantiating loaders - but creating
the JarLoader/JarFile is still technically O(number of entries), and
now getResource is constant time, instead of requiring a linear search
through every jar and the linked list of entries for each jar
(O(number of jars * entries/jar))

There are several caveats when the cache cannot be used with non-jar
URLs on the classpath, and the meta index, but I explain them in my
last email along with comments in the reference implementation

---
Regarding modified jars:
- moved/renamed: the file handle is still valid and it doesn't affect
the JVM/classloading
- deleted: the file handle is still valid and it doesn't affect the
JVM/classloading
- modified: the JVM crashes

The first two may not be intuitive, but remember that file handles
point to files; not paths on the filesystem.
So even though a jar appears renamed in the shell, the java process
has opened a file, somewhere in the c implementation of file objects
it has the file handle, and when the JVM does the system call read on
the file handle say to read the class from the jar file, it all works
fine
For what it's worth, here's a stack overflow answer as "source":
http://stackoverflow.com/questions/2028874/what-happens-to-an-open-file-handler-on-linux-if-the-pointed-file-gets-moved-de

---
There is a protected method URLClassLoader#addURL which appends a URL
to the classpath.
People could use reflection to make it public.
Because jars are opened lazily and the cache is also built lazily
whenever a jar is opened, it doesn't matter if paths are appended

Regarding people making extensive use of reflection to modify the
order of entries on the classpath, I believe that's irrelevant as
that's clearly not the semantic of URLClassLoader/URLClassPath.
People who need custom classloading rules create custom classloaders;
that's the purpose of classloaders being extensible

---
Anyways, I hope this was discussion worthy.
I've looked much into this and believe I haven't missed anything, but
if someone knows why it hasn't/can't be done any insight would be
appreciated!
Alan from the last email thread said "There was a lot of experiments
and prototypes in the past along these lines" - are the results
public?
He also mentioned improving classloading in Java's upcoming module
system (originally planned for Java 7, currently delayed to Java 9),
but I believe the algorithmic complexity and performance of
URLClassLoader could be improved without complicated changes

Please let me know what you think, and thanks for your time!

Best regards,
Adrian