RFR: 8347712: IllegalStateException on multithreaded ZipFile access with non-UTF8 charset [v2]

Eirik Bjørsnøs eirbjo at openjdk.org
Tue Mar 11 16:08:56 UTC 2025


On Tue, 11 Mar 2025 14:46:40 GMT, Jaikiran Pai <jpai at openjdk.org> wrote:

>> Can I please get a review of this change which proposes to fix an issue `java.util.zip.ZipFile` which would cause failures when multiple instances of `ZipFile` using non-UTF8 `Charset` were operating against the same underlying ZIP file? This addresses https://bugs.openjdk.org/browse/JDK-8347712.
>> 
>> ZIP file specification allows for ZIP entries to mark a `UTF-8` flag to indicate that the entry name and comment are encoded using UTF8. A `java.util.zip.ZipFile` can be constructed by passing it a `Charset`. This `Charset` (which defaults to UTF-8) gets used for decoding entry names and comments for non-UTF8 entries.
>> 
>> The internal implementation of `ZipFile` uses a `ZipCoder` (backed by `java.nio.charset.CharsetEncoder/CharsetDecoder` instance) for the given `Charset`. Except for UTF8 `ZipCoder`, other `ZipCoder`s are not thread safe.
>> 
>> The internal implementation of `ZipFile` maintains a cache of `ZipFile$Source`. A `Source` corresponds to the underlying ZIP file and during construction, uses a `ZipCoder` for parsing the ZIP entries and once constructed holds on to the parsed ZIP structure. Multiple instances of a `ZipFile` which all correspond to the same ZIP file on the filesystem, share a single instance of `Source` (after the `Source` has been constructed and cached). Although `ZipFile` instances aren't expected to be thread-safe, the fact that multiple different instances of `ZipFile` could be sharing the same instance of `Source` in concurrent threads, mandates that the `Source` must be thread-safe.
>> 
>> In Java 15, we did a performance optimization through https://bugs.openjdk.org/browse/JDK-8243469. As part of that change, we started holding on to the `ZipCoder` instance (corresponding to the `Charset` provided during `ZipFile` construction) in the `Source`. This stored `ZipCoder` was then used for `ZipFile` operations when working with the ZIP entries. As noted previously, any non-UTF8 `ZipCoder` is not thread-safe and as a result, any usages of `ZipCoder` in the `Source` makes `Source` not thread-safe too. That effectively violates the requirement that `Source` must be thread-safe to allow for its usage in multiple different `ZipFile` instances concurrently. This then causes `ZipFile` usages to fail in unexpected ways like the one shown in the linked https://bugs.openjdk.org/browse/JDK-8347712.
>> 
>> The commit in this PR addresses the issue by not maintaining `ZipCoder` as a instance field of `Source`. Instead the `ZipCoder` is now mainta...
>
> Jaikiran Pai has updated the pull request incrementally with one additional commit since the last revision:
> 
>   tiny typo fix in newly introduced documentation

@jaikiran I looked over this and added some comments. Some or most may be personal preference, take whatever you find useful. (Not a Reviewer)

src/java.base/share/classes/java/util/zip/ZipFile.java line 86:

> 84:     private final String fileName;     // name of the file
> 85:     // the ZipCoder instance is derived from the Charset passed to the ZipFile constructor
> 86:     // and will be used for decoding the non-UTF-8 entry names and the ZIP file comment.

Maybe a personal preference, and perhaps I'm too intimitely familiar with this code, but I feel this field comment is overly verbose. 

Not sure how useful it is to document what the instance is derived from, it's not something we do for other fields and maintainers can always inspect the constructor to find out how it's being assigned.

The "non-UTF-8" part is correct, but I feel this information belongs closer to where that decision is made, not here.

The ZipCoder is used when decoding entry comments as well, so the current comment is not fully correct.

Perhaps something like "Used for decoding binary encoded names and comments into strings" would do?

src/java.base/share/classes/java/util/zip/ZipFile.java line 87:

> 85:     // the ZipCoder instance is derived from the Charset passed to the ZipFile constructor
> 86:     // and will be used for decoding the non-UTF-8 entry names and the ZIP file comment.
> 87:     private final ZipCoder entryNameCommentCoder;

Since we do not have separate ZipCoders for decoding the differnt ZIP values, I'm not sure it's useful to put the field names ("name"/"comment") into the instance field name here. Especially if the comment already has this information.

Could we call this just `zc` or `zipCoder`?

src/java.base/share/classes/java/util/zip/ZipFile.java line 87:

> 85:     // the ZipCoder instance is derived from the Charset passed to the ZipFile constructor
> 86:     // and will be used for decoding the non-UTF-8 entry names and the ZIP file comment.
> 87:     private final ZipCoder entryNameCommentCoder;

The Source `ZipCoder` field had a `@Stable` annotation. Any reason why this did not survive the move?

src/java.base/share/classes/java/util/zip/ZipFile.java line 376:

> 374:      * @param pos the entry position
> 375:      */
> 376:     private static boolean useUTF8Coder(final byte[] cen, final int pos) {

This seems mostly used to determine which `ZipCoder` to pick. Could we fold this into the method, calling it `zipCoderForPos` and make it just return the `ZipCoder`, like we had in `Source`?

If it needs to be static, then the non-UTF8 ZipCoder could be passed as a parameter?

If a method to determine whether a CEN entry uses UTF-8 is still needed, then I think it would be more appropriately named  `isUTF8Entry`.

src/java.base/share/classes/java/util/zip/ZipFile.java line 1222:

> 1220:                 table[hsh] = index;
> 1221:                 // Record the CEN offset and the name hash in our hash cell.
> 1222:                 entries[index] = hash;

Seems unrelated to the issue at hand. Perhaps better left for a separate PR, backed by a benchmark.

src/java.base/share/classes/java/util/zip/ZipFile.java line 1426:

> 1424:             private final Charset entryNameCommentCharset;
> 1425: 
> 1426:             public Key(File file, BasicFileAttributes attrs, Charset entryNameCommentCharset) {

I feel this parameter and the field it is assigned to could also just be called `charset`. The comment has the information about what it's used for.

src/java.base/share/classes/java/util/zip/ZipFile.java line 1434:

> 1432:             @Override
> 1433:             public int hashCode() {
> 1434:                 long t = entryNameCommentCharset.hashCode();

This represents a behavioral change, right? Is a CSR warranted?

src/java.base/share/classes/java/util/zip/ZipFile.java line 1744:

> 1742: 
> 1743:                 int entryPos = pos + CENHDR;
> 1744:                 final ZipCoder entryZipCoder;

If you want to construct this lazily, then I think a comment documenting that purpose would be useful here.

-------------

PR Review: https://git.openjdk.org/jdk/pull/23986#pullrequestreview-2675038945
PR Review Comment: https://git.openjdk.org/jdk/pull/23986#discussion_r1989545209
PR Review Comment: https://git.openjdk.org/jdk/pull/23986#discussion_r1989552146
PR Review Comment: https://git.openjdk.org/jdk/pull/23986#discussion_r1989553775
PR Review Comment: https://git.openjdk.org/jdk/pull/23986#discussion_r1989558136
PR Review Comment: https://git.openjdk.org/jdk/pull/23986#discussion_r1989562779
PR Review Comment: https://git.openjdk.org/jdk/pull/23986#discussion_r1989564232
PR Review Comment: https://git.openjdk.org/jdk/pull/23986#discussion_r1989566059
PR Review Comment: https://git.openjdk.org/jdk/pull/23986#discussion_r1989622836


More information about the core-libs-dev mailing list