RFD: Reorganize ZipCoder such that UTF8 is handled by the base class
Chen Liang
chen.l.liang at oracle.com
Thu Jan 29 18:27:24 UTC 2026
Hello Eirik,
I strongly agree with your proposal. I see such a change has low risk given ZipCoder is an internal class.
Regards,
Chen
________________________________
From: core-libs-dev <core-libs-dev-retn at openjdk.org> on behalf of Eirik Bjørsnøs <eirbjo at gmail.com>
Sent: Wednesday, January 28, 2026 3:26 AM
To: core-libs-dev <core-libs-dev at openjdk.org>
Subject: RFD: Reorganize ZipCoder such that UTF8 is handled by the base class
Hi,
Bringing this up on core-libs-dev such that the motivation can be explained/discussed here and any future PR can focus on actual code changes.
Summary:
Reorganize the ZipCoder class hierarchy to let the base class handle UTF8 and the subclass handle arbitrary Charsets. This makes the design better match the ZIP specification and how ZIP files are used in the real world and additionally have some benefits in code quality and performance.
Motivation:
The ZipCoder class has been central to many ZipFile performance improvements in recent years. Many optimizations are encoding-specific and encapsulating these concerns makes a lot of sense.
Currently, the base ZipCoder instance supports any given Charset. Then, a subclass UTF8ZipCoder provides higher performance optimizations specific to UTF-8.
However, real-world use of the ZipFile API defaults to UTF-8. The ZIP specification long-ago introduced a flag to explicitly indicate that entries are encoded using UTF-8. The JAR specification has mandated UTF-8 since the beginning. Any use of non-UTF-8 ZIP files is increasingly niche and belongs in the legacy zone.
The current UTF8ZipCoder is stateless and documented as thread safe, while the base class ZipCoder is not. As a subclass of ZipCode, UTF8ZipCoder does however inherit CharsetEncoder and CharsetDecoder state fields from its super class and it needs to pass a UTF8 Charset to its parent, without really using it. This makes state and thread safety harder to reason about.
Since UTF8ZipCoder is always needed, the JVM must always load it along with the base class ZipCoder. Apart from loading an extra class, this prevents the JVM from seeing calls to ZipCoder methods as monomorphic.
A draft implementation of this change indicates a ~3% performance win on ZipFile lookups in ZipFileGetEntry, probably explained by the compiler seeing only one instance of ZipCoder being loaded.
Solution:
Switch the class hierarchy of ZipCoder around such that the base class handles UTF-8. Introduce a new subclass CharsetZipCoder to handle legacy non-UTF ZIP files. Move the Charset, CharsetEncoder, CharsetDecoder fields to this subclass. Update code comments to reflect the changes.
Risks:
This should be a pure refactoring, mostly moving code around. Most changes can be performed in-place, such that side by side review will mostly reflect indentation changes. We have good test coverage for UTF8 and non-UTF-8 ZIP files to help us catch regressions.
If I see support for this proposal, I'll be happy to submit a PR with the actual changes.
Cheers,
Eirik :-)
Confidential- Oracle Internal
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/core-libs-dev/attachments/20260129/66fa523d/attachment-0001.htm>
More information about the core-libs-dev
mailing list