RFD: Reorganize ZipCoder such that UTF8 is handled by the base class
Eirik Bjørsnøs
eirbjo at gmail.com
Wed Jan 28 09:26:44 UTC 2026
Hi,
Bringing this up on core-libs-dev such that the motivation can be
explained/discussed here and any future PR can focus on actual code changes.
Summary:
Reorganize the ZipCoder class hierarchy to let the base class handle UTF8
and the subclass handle arbitrary Charsets. This makes the design better
match the ZIP specification and how ZIP files are used in the real world
and additionally have some benefits in code quality and performance.
Motivation:
The ZipCoder class has been central to many ZipFile performance
improvements in recent years. Many optimizations are encoding-specific and
encapsulating these concerns makes a lot of sense.
Currently, the base ZipCoder instance supports any given Charset. Then, a
subclass UTF8ZipCoder provides higher performance optimizations specific to
UTF-8.
However, real-world use of the ZipFile API defaults to UTF-8. The ZIP
specification long-ago introduced a flag to explicitly indicate that
entries are encoded using UTF-8. The JAR specification has mandated UTF-8
since the beginning. Any use of non-UTF-8 ZIP files is increasingly niche
and belongs in the legacy zone.
The current UTF8ZipCoder is stateless and documented as thread safe, while
the base class ZipCoder is not. As a subclass of ZipCode, UTF8ZipCoder does
however inherit CharsetEncoder and CharsetDecoder state fields from its
super class and it needs to pass a UTF8 Charset to its parent, without
really using it. This makes state and thread safety harder to reason about.
Since UTF8ZipCoder is always needed, the JVM must always load it along with
the base class ZipCoder. Apart from loading an extra class, this prevents
the JVM from seeing calls to ZipCoder methods as monomorphic.
A draft implementation of this change indicates a ~3% performance win on
ZipFile lookups in ZipFileGetEntry, probably explained by the compiler
seeing only one instance of ZipCoder being loaded.
Solution:
Switch the class hierarchy of ZipCoder around such that the base class
handles UTF-8. Introduce a new subclass CharsetZipCoder to handle legacy
non-UTF ZIP files. Move the Charset, CharsetEncoder, CharsetDecoder fields
to this subclass. Update code comments to reflect the changes.
Risks:
This should be a pure refactoring, mostly moving code around. Most changes
can be performed in-place, such that side by side review will mostly
reflect indentation changes. We have good test coverage for UTF8 and
non-UTF-8 ZIP files to help us catch regressions.
If I see support for this proposal, I'll be happy to submit a PR with the
actual changes.
Cheers,
Eirik :-)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/core-libs-dev/attachments/20260128/9c0f198e/attachment.htm>
More information about the core-libs-dev
mailing list