RFR: 8347712: IllegalStateException on multithreaded ZipFile access with non-UTF8 charset [v6]
Eirik Bjørsnøs
eirbjo at openjdk.org
Mon Mar 24 15:18:13 UTC 2025
On Sun, 23 Mar 2025 09:28:38 GMT, Jaikiran Pai <jpai at openjdk.org> wrote:
>> Can I please get a review of this change which proposes to fix an issue `java.util.zip.ZipFile` which would cause failures when multiple instances of `ZipFile` using non-UTF8 `Charset` were operating against the same underlying ZIP file? This addresses https://bugs.openjdk.org/browse/JDK-8347712.
>>
>> ZIP file specification allows for ZIP entries to mark a `UTF-8` flag to indicate that the entry name and comment are encoded using UTF8. A `java.util.zip.ZipFile` can be constructed by passing it a `Charset`. This `Charset` (which defaults to UTF-8) gets used for decoding entry names and comments for non-UTF8 entries.
>>
>> The internal implementation of `ZipFile` uses a `ZipCoder` (backed by `java.nio.charset.CharsetEncoder/CharsetDecoder` instance) for the given `Charset`. Except for UTF8 `ZipCoder`, other `ZipCoder`s are not thread safe.
>>
>> The internal implementation of `ZipFile` maintains a cache of `ZipFile$Source`. A `Source` corresponds to the underlying ZIP file and during construction, uses a `ZipCoder` for parsing the ZIP entries and once constructed holds on to the parsed ZIP structure. Multiple instances of a `ZipFile` which all correspond to the same ZIP file on the filesystem, share a single instance of `Source` (after the `Source` has been constructed and cached). Although `ZipFile` instances aren't expected to be thread-safe, the fact that multiple different instances of `ZipFile` could be sharing the same instance of `Source` in concurrent threads, mandates that the `Source` must be thread-safe.
>>
>> In Java 15, we did a performance optimization through https://bugs.openjdk.org/browse/JDK-8243469. As part of that change, we started holding on to the `ZipCoder` instance (corresponding to the `Charset` provided during `ZipFile` construction) in the `Source`. This stored `ZipCoder` was then used for `ZipFile` operations when working with the ZIP entries. As noted previously, any non-UTF8 `ZipCoder` is not thread-safe and as a result, any usages of `ZipCoder` in the `Source` makes `Source` not thread-safe too. That effectively violates the requirement that `Source` must be thread-safe to allow for its usage in multiple different `ZipFile` instances concurrently. This then causes `ZipFile` usages to fail in unexpected ways like the one shown in the linked https://bugs.openjdk.org/browse/JDK-8347712.
>>
>> The commit in this PR addresses the issue by not maintaining `ZipCoder` as a instance field of `Source`. Instead the `ZipCoder` is now mainta...
>
> Jaikiran Pai has updated the pull request incrementally with three additional commits since the last revision:
>
> - improve code comment for ZipFile.zipCoder
> - Alan's suggestion - change code comment about Source class being thread safe
> - Alan's suggestion - trim the javadoc of (internal) ZipCoder class
I made another pass over the updated ZipFile comments and left some remarks.
src/java.base/share/classes/java/util/zip/ZipCoder.java line 181:
> 179:
> 180: /**
> 181: * {@return the {@link Charset} used this {@code ZipCoder}}
Suggestion:
* {@return the {@link Charset} used by this {@code ZipCoder}}
src/java.base/share/classes/java/util/zip/ZipFile.java line 85:
> 83: private final String filePath; // ZIP file path
> 84: private final String fileName; // name of the file
> 85: // ZipCoder for entry names and comments when not using UTF-8
"_when not using UTF-8_" could be confusing.
The ZipCoder here may well be UTF-8, it's more about the entry not mandating UTF-8 by having its language encoding flag set.
I think we should either clearly explain when this ZipCoder is used (when entries do not mandate UTF-8), or drop the information here and lean on it being explained in `zipCoderFor`.
If we decide to drop it, this could be just:
`// Used when decoding entry names and comments`
If we decide to keep it, then something like:
`// Used when decoding entry names and comments, unless entry flags mandate UTF-8`
src/java.base/share/classes/java/util/zip/ZipFile.java line 1145:
> 1143: static record EntryPos(String name, int pos) {}
> 1144:
> 1145: // Implementation note: This class is be thread safe.
Suggestion:
// Implementation note: This class is thread safe.
src/java.base/share/classes/java/util/zip/ZipFile.java line 1432:
> 1430: * where a ZIP file is re-opened after it has been modified).
> 1431: * - The Charset, that was provided when constructing a ZipFile instance,
> 1432: * for reading non-UTF-8 entry names and comments.
I think it would be sufficient to say "The Charset that was provided when constructing the ZipFile instance". Any non-UTF8-ness is better explained elsewhere.
src/java.base/share/classes/java/util/zip/ZipFile.java line 1438:
> 1436: private final BasicFileAttributes attrs;
> 1437: private final File file;
> 1438: // the Charset to be used for processing non-UTF-8 entry names in the ZIP file
Similarly this could just say "The Charset that was provided when constructing the ZipFile instance"
-------------
PR Review: https://git.openjdk.org/jdk/pull/23986#pullrequestreview-2710281436
PR Review Comment: https://git.openjdk.org/jdk/pull/23986#discussion_r2010124278
PR Review Comment: https://git.openjdk.org/jdk/pull/23986#discussion_r2010184376
PR Review Comment: https://git.openjdk.org/jdk/pull/23986#discussion_r2010368552
PR Review Comment: https://git.openjdk.org/jdk/pull/23986#discussion_r2010375936
PR Review Comment: https://git.openjdk.org/jdk/pull/23986#discussion_r2010377989
More information about the core-libs-dev
mailing list