RFR: 8322802: Add testing for ZipFile.getEntry respecting the 'Language encoding' flag [v3]

Tue Jan 2 08:05:47 UTC 2024

On Mon, 1 Jan 2024 14:29:50 GMT, Eirik Bjørsnøs <eirbjo at openjdk.org> wrote:

>> Please review this test-only PR which adds test coverage for `ZipFile.getEntry` under certain charset conditions. 
>> 
>> When `ZipFile.getEntry` is called for an entry which has the `Language encoding flag` general purpose bit flag set,  then `ZipCoder.UTF8` is used unconditionally, even when a different charset was supplied to the `ZipFile` constructor.
>> 
>> It turns out we do not have any testing for this particular case, as can be verified by commenting out the following line of code in `ZipFile.Source.getEntryPos`:
>> 
>> 
>> //ZipCoder zc = zipCoderForPos(pos);
>> ``` 
>> 
>> and then running `make test TEST="test/jdk/java/util/zip"`
>> 
>> The current test verifies that the correct ZipCoder is used by `ZipFile.entries()`, but does not exercise `ZipFile.getEntry` the same way.
>> 
>> Seeing that [JDK-7009069](https://bugs.openjdk.org/browse/JDK-7009069) was (accidentally?) fixed by [JDK-8243469](https://bugs.openjdk.org/browse/JDK-8243469), I think it is worthwhile to add explicit testing for this case to avoid regressions.
>> 
>> While visiting `ZipCoding.java`, I took the opportunity to convert it to JUnit 5. The conversion and modernization of the code is done in the first commit 1384850ed51ec845af06dd6d13616f20f8bbaa6a in this PR, while the second commit 1776b258b0fe8383709ae0c095f2631a4e6237f6 actually adds the code required to verify the `Language encoding flag` condition for `ZipFile.getEntry`.
>> 
>> Testing: Verified that the test indeed fails when `ZipFile.Source.getEntryPos` is updated to use the ZipFile's ZipCoder as suggested above.
>
> Eirik Bjørsnøs has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Add @bug tag for 8322802

This is a test-only change PR and the changes being proposed look good to me.

On a general note - I hadn't paid attention to this ZipFile constructor which accepts a character encoding, before. I read up on the Zip file specific today (https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT) to understand it a bit more. Interestingly, in that spec (if I read it correctly), unless the `0x0008` "extended language encoding" header is set, there isn't anything that says/allows the entry name or entry comment to be anything other than either UTF-8 or "IBM Code Page 437" character encoding:

APPENDIX D - Language Encoding (EFS)
------------------------------------

D.1 The ZIP format has historically supported only the original IBM PC character 
encoding set, commonly referred to as IBM Code Page 437.  This limits storing 
file name characters to only those within the original MS-DOS range of values 
and does not properly support file names in other character encodings, or 
languages. To address this limitation, this specification will support the 
following change. 

D.2 If general purpose bit 11 is unset, the file name and comment SHOULD conform 
to the original ZIP character encoding.  If general purpose bit 11 is set, the 
filename and comment MUST support The Unicode Standard, Version 4.1.0 or 
greater using the character encoding form defined by the UTF-8 storage 
specification.  ....
....

D.3 Applications MAY choose to supplement this file name storage through the use 
of the 0x0008 Extra Field.  Storage for this optional field is currently 
undefined, however it will be used to allow storing extended information 
on source or target encoding that MAY further assist applications with file 
name, or file content encoding tasks.  Please contact PKWARE with any
requirements on how this field SHOULD be used.

D.4 The 0x0008 Extra Field storage MAY be used with either setting for general 
purpose bit 11.  Examples of the intended usage for this field is to store 
whether "modified-UTF-8" (JAVA) is used, or UTF-8-MAC.  Similarly, other 
commonly used character encoding (code page) designations can be indicated 
through this field.  Formalized values for use of the 0x0008 record remain 
undefined at this time.  The definition for the layout of the 0x0008 field
will be published when available.  Use of the 0x0008 Extra Field provides
for storing data within a ZIP file in an encoding other than IBM Code
Page 437 or UTF-8.

The change to allow user/application specific arbritary charsets to the `ZipFile` constructor seems to have been done long back in Java 1.7 days as part of https://bugs.openjdk.org/browse/JDK-4244499.

This is merely an observation and I don't expect any changes to the ZipFile source or any of your proposed tests, in this context.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/17207#issuecomment-1873711203