ZipOutputStream & directories
Lance Andersen
lance.andersen at oracle.com
Mon Mar 6 19:16:30 UTC 2023
Hi Eirik,
Please see below
On Mar 6, 2023, at 1:43 PM, Eirik Bjørsnøs <eirbjo at gmail.com<mailto:eirbjo at gmail.com>> wrote:
Hi,
I noticed that ZipOutputStream violates APPNOTE.txt when writing directory entries:
You are referring to section:
------------
4.3.8 File data
Immediately following the local header for a file
SHOULD be placed the compressed or stored data for the file.
If the file is encrypted, the encryption header for the file
SHOULD be placed after the local header and before the file
data. The series of [local file header][encryption header]
[file data][data descriptor] repeats for each file in the
.ZIP archive.
Zero-byte files, directories, and other file types that
contain no content MUST NOT include file data.
——————
I don’t think changing the default will cause any harm but I have been surprised by even the most trivial change before.
The jar tool already addresses this in Main.java:addFile() on a quick glance.
So I guess I do not see a huge downside, however we might want to look at a property to adjust in the unlikely event the unexpected occurs?
Alan thoughts?
Zero-byte files, directories, and other file types that
contain no content MUST NOT include file data.
Nobody seems to have complained about this violation (I searched JBS), so in itself it might not be worth pursuing a fix. However, in addition to breaking the specification, this also causes suboptimal performance both on the write and read paths.
Before I go ahead and suggest any particular solution, I wanted to share some analysis:
ZipOutputStream writes directories as any other entry, so by default it ends up writing a LOC header with the DEFLATED method and the data descriptor bit set. This is followed by a final, empty DEFLATE block (0x0300) and then a signed data descriptor with the CRC, csize (2) and size (0), each 4 bytes:
------ Local File Header ------
000000 signature 0x04034b50
000004 version 20
000006 flags 0x0808
000008 method 8 Deflated
000010 time 0x9812 19:00:36
000012 date 0x5666 2023-03-06
000014 crc 0x00000000
000018 csize 0
000022 size 0
000026 nlen 9
000028 elen 0
000030 name 9 bytes 'META-INF/'
------ File Data ------
000039 data 2 bytes
------ Data Descriptor ------
000041 signature 0x08074b50
000045 crc 0x00000000
000049 csize 2
000053 size 0
In total, this means that an extra 18 bytes is output, which is a waste of storage space and IO cycles while writing and reading/parsing the unnecessary headers and data. If the directory entry has the STORED method and the sizes and crc is set to 0, then the file data and data descriptors are skipped and we reclaim 18 bytes.
The use of DEFLATE has an adverse negative impact on performance, since native ZLIB code needs to be called to produce and read (in ZipInputStream) the empty DEFLATE block. Worse, Deflater.reset and Inflater.reset are called, which seems to incur a significant overhead.
In fact, when configuring the STORED method and explicitly setting csize, size and crc to 0, we see considerable benchmark improvements writing and reading directory entries:
DEFLATED:
Benchmark (size) Mode Cnt Score Error Units
ZipEmptyStreams.readDirStream 512 avgt 15 0.278 ± 0.005 ms/op
ZipEmptyStreams.readDirStream 1024 avgt 15 0.550 ± 0.006 ms/op
ZipEmptyStreams.writeDirStreams 512 avgt 15 2.379 ± 0.043 ms/op
ZipEmptyStreams.writeDirStreams 1024 avgt 15 4.845 ± 0.087 ms/op
STORED:
Benchmark (size) Mode Cnt Score Error Units
ZipEmptyStreams.readDirStream 512 avgt 15 0.082 ± 0.001 ms/op
ZipEmptyStreams.readDirStream 1024 avgt 15 0.176 ± 0.015 ms/op
ZipEmptyStreams.writeDirStream 512 avgt 15 0.233 ± 0.021 ms/op
ZipEmptyStreams.writeDirStream 1024 avgt 15 0.460 ± 0.028 ms/op
Possible actions:
- Do nothing: Nobody complained so far, so if is ain't broke, don't fix it.
- Update Javadocs of ZipInputStream.putNextEntry to recommend that users should use STORED and configure sizes & crc for directory entries for correctness and performance
- Change the default compression method for directories to STORED and set sizes & crc to 0. This would add the risk of changing behavior of existing code.
For context, 7% of entries in the dependencies of Spring Petclinic are directories.
Any thoughts or input?
Thanks,
Eirik.
[cid:E1C4E2F0-ECD0-4C9D-ADB4-B16CA7BCB7FC at home]
Lance Andersen | Principal Member of Technical Staff | +1.781.442.2037
Oracle Java Engineering
1 Network Drive
Burlington, MA 01803
Lance.Andersen at oracle.com<mailto:Lance.Andersen at oracle.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/core-libs-dev/attachments/20230306/c217cb20/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: oracle_sig_logo.gif
Type: image/gif
Size: 658 bytes
Desc: oracle_sig_logo.gif
URL: <https://mail.openjdk.org/pipermail/core-libs-dev/attachments/20230306/c217cb20/oracle_sig_logo-0001.gif>
More information about the core-libs-dev
mailing list