RFD: Reorganize ZipCoder such that UTF8 is handled by the base class
Eirik Bjørsnøs
eirbjo at gmail.com
Sat Jan 31 18:40:19 UTC 2026
Thank you Chen!
Filed https://bugs.openjdk.org/browse/JDK-8376842 to track this enhancement.
Eirik.
On Thu, Jan 29, 2026 at 7:27 PM Chen Liang <chen.l.liang at oracle.com> wrote:
> Hello Eirik,
> I strongly agree with your proposal. I see such a change has low risk
> given ZipCoder is an internal class.
>
> Regards,
> Chen
>
> ------------------------------
> *From:* core-libs-dev <core-libs-dev-retn at openjdk.org> on behalf of Eirik
> Bjørsnøs <eirbjo at gmail.com>
> *Sent:* Wednesday, January 28, 2026 3:26 AM
> *To:* core-libs-dev <core-libs-dev at openjdk.org>
> *Subject:* RFD: Reorganize ZipCoder such that UTF8 is handled by the base
> class
>
> Hi,
>
> Bringing this up on core-libs-dev such that the motivation can be
> explained/discussed here and any future PR can focus on actual code changes.
>
> Summary:
>
> Reorganize the ZipCoder class hierarchy to let the base class handle UTF8
> and the subclass handle arbitrary Charsets. This makes the design better
> match the ZIP specification and how ZIP files are used in the real world
> and additionally have some benefits in code quality and performance.
>
> Motivation:
>
> The ZipCoder class has been central to many ZipFile performance
> improvements in recent years. Many optimizations are encoding-specific and
> encapsulating these concerns makes a lot of sense.
>
> Currently, the base ZipCoder instance supports any given Charset. Then, a
> subclass UTF8ZipCoder provides higher performance optimizations specific to
> UTF-8.
>
> However, real-world use of the ZipFile API defaults to UTF-8. The ZIP
> specification long-ago introduced a flag to explicitly indicate that
> entries are encoded using UTF-8. The JAR specification has mandated UTF-8
> since the beginning. Any use of non-UTF-8 ZIP files is increasingly niche
> and belongs in the legacy zone.
>
> The current UTF8ZipCoder is stateless and documented as thread safe, while
> the base class ZipCoder is not. As a subclass of ZipCode, UTF8ZipCoder does
> however inherit CharsetEncoder and CharsetDecoder state fields from its
> super class and it needs to pass a UTF8 Charset to its parent, without
> really using it. This makes state and thread safety harder to reason about.
>
> Since UTF8ZipCoder is always needed, the JVM must always load it along
> with the base class ZipCoder. Apart from loading an extra class, this
> prevents the JVM from seeing calls to ZipCoder methods as monomorphic.
>
> A draft implementation of this change indicates a ~3% performance win on
> ZipFile lookups in ZipFileGetEntry, probably explained by the compiler
> seeing only one instance of ZipCoder being loaded.
>
> Solution:
>
> Switch the class hierarchy of ZipCoder around such that the base class
> handles UTF-8. Introduce a new subclass CharsetZipCoder to handle legacy
> non-UTF ZIP files. Move the Charset, CharsetEncoder, CharsetDecoder fields
> to this subclass. Update code comments to reflect the changes.
>
> Risks:
>
> This should be a pure refactoring, mostly moving code around. Most changes
> can be performed in-place, such that side by side review will mostly
> reflect indentation changes. We have good test coverage for UTF8 and
> non-UTF-8 ZIP files to help us catch regressions.
>
> If I see support for this proposal, I'll be happy to submit a PR with the
> actual changes.
>
> Cheers,
> Eirik :-)
>
>
>
>
>
>
>
>
>
> Confidential- Oracle Internal
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/core-libs-dev/attachments/20260131/7b9707a2/attachment.htm>
More information about the core-libs-dev
mailing list