proposal to optimise the performance of the Jar utility

Wed May 11 23:51:00 UTC 2011

Hi,
I have an update of the optimisations to date

In summary jar can be 3 to 4 times faster and becomes CPU bound on all 4 cores 
of my dev system

the results are as a CSV below
I have included tests of the 1.6 and 1.7 code and runtime for comparison

the optimisation that I have completed are 
1. increase output buffer size
2. add an option (D) to omit file dates (date loading was a significant overhead 
before (4), less so afterwards, but still measurable improvements and may not be 
useful in some circumstances (e.g. my use cases)
3. pipeline the scanning for the files with the output, queuing file info via a 
BlockingQueue
4. rewrite the scanning to use the FileVisitor
5. temporary option (Zn) to specify the parallel option used
Z0 runs the load of the file in a single thread
Z1 runs parallel load of small files into memory caches, and runs the load of 
the zip file in another thread
Z2 uses a (mostly) parallel version of Zip/JarOutputStream called Zip/JarWriter
6 decreased the calls to BufferedOutputStream.write(int) (in Z2 mode) to limit 
the overheads of synchronisation
7. modified some Jar internal data structures
8 eliminated double reading of a file in STORED mode for 'small' files (was once 
for CRC and again for data)
9 probably a few other tweaks that I have forgotten

I have only looked at the create path. I have mod modified the update or extract 
paths

---
results, is csv format
Runtime,code,qualifiers,comment,duration
,,,,
jdk1.6.0_24,1.6 tools, -cf0, java 1.6 rt -cf0,9538
jdk1.6.0_24,1.6 tools, -cf, java 1.6 rt -cf,10218
jdk1.7 dev,1.7 tools, -cf0, java 1.7 rt -cf0,9908
jdk1.7 dev,1.7 tools, -cf, java 1.7 rt -cf,10680
jdk1.7 dev,1.7 dev baseline, -cf0, orig 1.7 rt -cf0,10270
jdk1.7 dev,1.7 dev baseline, -cf, orig 1.7 rt -cf,10713
jdk1.7 dev,1.7 dev, -cf0, project 1.7 rt -cf0,3144
jdk1.7 dev,1.7 dev, -cf0Z0, project 1.7 rt -cf0 parallel 0,3031
jdk1.7 dev,1.7 dev, -cf0DZ0, project 1.7 rt -cf0D parallel 0,2837
jdk1.7 dev,1.7 dev, -cf0DZ1, project 1.7 rt -cf0D parallel 1,2142
jdk1.7 dev,1.7 dev, -cf0DZ2, project 1.7 rt -cf0D parallel 2,2458
jdk1.7 dev,1.7 dev, -cf1, project 1.7 rt -cf1,5418
jdk1.7 dev,1.7 dev, -cf1DZ0, project 1.7 rt -cf1D parallel 0,5420
jdk1.7 dev,1.7 dev, -cf1DZ1, project 1.7 rt -cf1D parallel 1,5319
jdk1.7 dev,1.7 dev, -cf1DZ2, project 1.7 rt -cf1D parallel 2,3260
jdk1.7 dev,1.7 dev, -cf2, project 1.7 rt -cf2,5515
jdk1.7 dev,1.7 dev, -cf3, project 1.7 rt -cf3,5792
jdk1.7 dev,1.7 dev, -cf4, project 1.7 rt -cf4,5862
jdk1.7 dev,1.7 dev, -cf5, project 1.7 rt -cf5,6116
jdk1.7 dev,1.7 dev, -cf6, project 1.7 rt -cf6,6328
jdk1.7 dev,1.7 dev, -cf6DZ0, project 1.7 rt -cf6D parallel 0,6048
jdk1.7 dev,1.7 dev, -cf6DZ1, project 1.7 rt -cf6D parallel 1,6048
jdk1.7 dev,1.7 dev, -cf6DZ2, project 1.7 rt -cf6D parallel 2,3405
jdk1.7 dev,1.7 dev, -cf7, project 1.7 rt -cf7,6497
jdk1.7 dev,1.7 dev, -cf8, project 1.7 rt -cf8,6375
jdk1.7 dev,1.7 dev, -cf9, project 1.7 rt -cf9,6388
jdk1.7 dev,1.7 dev, -cf9DZ0, project 1.7 rt -cf9D parallel 0,6422
jdk1.7 dev,1.7 dev, -cf9DZ1, project 1.7 rt -cf9D parallel 1,6224
jdk1.7 dev,1.7 dev, -cf9DZ2, project 1.7 rt -cf9D parallel 2,3482

________________________________
From: "mike.skells at talk21.com" <mike.skells at talk21.com>
To: Mike Duigou <mike.duigou at oracle.com>
Cc: core-libs-dev Libs <core-libs-dev at openjdk.java.net>
Sent: Thursday, 14 April, 2011 0:55:04
Subject: Re: proposal to optimise the performance of the Jar utility

Hi Mike,
apart from compression 0 it doesn't make a lot of difference as far as I see it. 

As I posted earlier, converting the file name to lower case seems (according the 

the NB profiler) to take more time than the compression

Bearing in mind that I have tweaked my system so that I am accessing from cache 
only, so the elapsed time is pretty much the single core time, here are the 
results

the result I have re-run today (hence the delay)
in csv format
---
,cf,cf0,cf1,cf2,cf3,cf4,cf5,cf6,cf7,cf8,cf9
1.6.0_24 (1.6 vm),10255,9751,,,,,,,,,
java 1.7 release candinate,10498,9663,,,,,,,,,
orig ,10481,9707,,,,,,,,,
buffer CRC32 data,,7398,9490,9641,9565,10038,10142,10366,10310,10536,10440
---
this run shows a 25% improvement in cf0 times (9703 ms vs 7398 wthen the tweaks 
are applied)
cf1 (compressed level 1) takes 9490 and cf9 takes 10440, so not a huge 
difference

Regards

Mike

________________________________
From: Mike Duigou <mike.duigou at oracle.com>
To: mike.skells at talk21.com
Cc: Xueming Shen <xueming.shen at oracle.com>; core-libs-dev Libs 
<core-libs-dev at openjdk.java.net>
Sent: Wednesday, 13 April, 2011 23:00:26
Subject: Re: proposal to optimise the performance of the Jar utility

Mike, can you share the results of performance testing at various compression 
levels? Is there much difference between the levels or an apparent "sweet spot"?

For low hanging fruit for jdk 7 it might be worth considering raising the 
default compression level from 5 to 6 (the zlib default). Raising the level from 

5 to 6 entails (by today's standards) very modest increases in the amount of 
memory and effort used (16Kb additional buffer space for compression). In 
general zlib reflects size choices that are almost 20 years old and it may be of 

no measurable benefit to be using the lower compression levels.

Mike (also) 

On Apr 12 2011, at 17:48 , mike.skells at talk21.com wrote:

> Hi Sherman,
> I have had a quick look at the current code to see what 'low hanging fruit' 
> there is. I appreciate that parallelizing the command in its entirity may not 
>be 
>
> feasible for the first release
> 
> The tests that I have run are jarring the content of the 1.7 rt.jar with 
>varying 
>
> compression levels. Each jar is run as an child process 6 times and the average 
>
>
> of the last 5 is taken. Tests are pretty much CPU bound on a single core
> 
> 1. The performance of the cf0 (create file with no compression) path can be 
> improved for the general case if the file is buffered.  
> In my test env (windows 7 64bit) this equates to a 17% time performance 
> improvement in my tests. In the existing code the file is read twice, once to 
> calc the CRC and once to write the file to the Jar file. This change would 
> buffer a single small file at a time (size < 100K)
> 
> 2. It is also a trivial fix to allow different compression levels, rather than 

> stored and default
> 
> After that it is harder to gain performance improvements without structural 
> change or more discussion
> 
> 3. The largest saving after that is in the management of the 'entries' Set, as 

> the hashcode of the File is expensive (this may not apply to other 
>filesystems). 
>
> the management of this map seems to account for more cpu than Deflator!
> I cannot see the reason for this data being collected. I am probably missing 
>the 
>
> obvious ...
> 
> 4. After that there is just the parallelisation of the jar utility and the 
> underlying stream
> 
> What is the best way to proceed
> 
> regards
> 
> Mike
> 
> 
> 
> ________________________________
> From: Xueming Shen <xueming.shen at oracle.com>
> To: core-libs-dev at openjdk.java.net
> Sent: Wednesday, 6 April, 2011 19:04:25
> Subject: Re: proposal to optimise the performance of the Jar utility
> 
> Hi Mike,
> 
> We are in the final stage of the JDK7 development, work like this is 
> unlikely to get in the
> last minute. I have filed a CR/RFE to trace this issue, we can use this 
> CR as the start
> point for the discussion and target for some jdk7 update releases or JDK8.
> 
> 7034403: proposal to optimise the performance of the Jar utility
> 
> Thanks,
> Sherman
> 
> 
> On 04/05/2011 04:42 PM, mike.skells at talk21.com wrote:
>> Hi,
>> Not sure if this is too late for Java 7 but I have made some optimisations for 
>
>
>> a
>> client to improve the performance of the jar utility in their environment, 
and
>> would like to promite then into the main code base
>> 
>> The optimisations that I have performed are
>> 
>> 1. Allowing the Jar utility to have other compression levels (currently it
>> allows default (5) only)
>> 2. Multi-threading, and pipelining the the file information and access
>> 3. multi-threading the compression and file writing
>> 
>> A little background
>> A part of the development process of where I work they regularly Jar the 
>> content
>> of the working projects as part of the distribution to remote systems. This is 
>
>
>> a
>> large and complex code base of 6 million LOC and growing. The Jar file ends 
up
>> compressed to approx 100Mb, Uncompressed the jar size is approx 245mb, about 
>> 4-5
>> times the size of rt.jar.
>> 
>> I was looking at ways to improve the performance as this activity occurs 
>> several
>> times a day for dozens of developers
>> 
>> In essence when compressing a new jar file the jar utility is single threaded
>> and staged. Forgive me if this is an oversimplification
>> 
>> first  it works out all of the files that are specified, buffering the file
>> names, (IO bound)
>> then it iterates through the files, and for each file, it load  the file
>> information, and then the file content sending it to a JarOutputStream, (CPU
>> bound or IO bound depending on the IO speed)
>> 
>> The JarOutputStream has a compression of 0 (just store) or 5 (the default), 
> and
>> the jar writing is single threaded by the design of the JarOutputStream
>> 
>> The process of creation of a Jar took about 20 seconds in windows with the 
> help
>> of an SSD, and considerable longer without one, and was CPU bound to one CPU
>> core
>> 
>> ----
>> The changes that I made were
>> 1. Allow deferent compression levels (for us a compression level of 1 
> increases
>> the file size of the Jar to 110 Mb but reduces the CPU load in compression to
>> approx 30% of what it was (rough estimate)
>> 2. pipelining the file access
>> 2.1    one thread is started for each file root (-C on the Jar command line),
>> which scans for files and places the file information into a blocking 
>> queue(Q1),
>> which I set to abretrary size of 200 items
>> 2.2    one thread pool of 10 threads reads the file information from the 
queue
>> (Q1) and buffers the file content to a specified size (again I specified an
>> arbetrary size limit of 25K for a file, and places the buffered content into 
a
>> queue(q2) (again arbetrary size of 10 items
>> 2.3    one thread takes the filecontent from Q2 and compresses it or 
checksums
>> it and adds it the the  JarOutputStream. This process is single threaded due 
> to
>> the design of the JarOutputStream
>> 
>> some other minor performance gain occurred by increasing the buffer on the
>> output stream to reduce the IO load
>> 
>> The end result is that the process takes about approx 5 seconds in the same
>> configuration
>> 
>> The above is in use in production configuration for a few months now
>> 
>> As a home project I have completed some enhancements to the JarOutputStream, 
>> and
>> produced a JarWriter that allows multiple threads to work concurrently 
>> deflating
>> or calculating checksums, which seems to test OK for the test cases that 
Ihave
>> generated,and successfully loads my quad core home dev machine on all cores.
>> Each thread allocates a buffer, and the thread compresses a files into the
>> buffer, only blocking other threads whenthe buffer is written to the output
>> (which is after the compression is complete, unless the file is too large to
>> compress
>> 
>> This JarWriter is not API compatable with the JarOutputStream, it is not a
>> stream. It allows the programmer to write a record based of the file 
>> information
>> and an input stream, and is threadsafe. It is not a drop in replacement for
>> JarOutputStream
>> I am not an expert in the ZIp file format, but much of the code from
>> ZipOutputStream is unchanged, just restructured
>> ---
>> I did think that there is some scope for improvement, that I have not looked 
> at
>> a. thresholding of file size for compression (very small files dont compress
>> well
>> b. some file types dont compress well (e.g. png, jpeg) as they have been
>> compressed already)
>> c. using NIO to parallelise the loading of the file information or content
>> d. some pre-charging of the deflator dictionary (e.g. a class file contains 
> the
>> strings of the class name and packages), but this would make the format
>> incompatable with zip, and require changes to the JVM to be useful, and is a
>> long way from my comform zone, or skill set. This would reduce the file size
>> 
>> --
>> What is the view of the readers. Is this something, or at least some parts of
>> this that could be incorperated into Java 7 or is this too late on the dev 
>> cycle
>> 
>> regards
>> 
>> Mike