RFR: 8243254: Examine ZipFile slash optimization for non-ASCII compatible charsets
Hi, ZipFile.getEntry does optimizations to check for directory entries by adding a '/' to the encoded byte array. JDK-8242959 improved on this optimization, but a question was raised Jason Zaugg on whether the optimization is always valid[1]. Turns out it isn't, but only for non-ASCII compatible charsets. While JarFiles are always assumed to be UTF-8-encoded or compatible, ZipFiles allow arbitrary encoding. E.g., UTF-16 would encode '/' (2F) as either 2F 00 or 00 2F, which means the hash code would differ and a directory "foo/" potentially not be found when looking up "foo". Further complications arise when/if the directory name ends with a code point that might be encoded so that the final byte is 2F, e.g. \u012F. This patch only enables this low-level optimization when the charset encoding used is known to be ASCII compatible in the sense that 2F will be encoded as single-byte 2F. UTF-8 is compatible in this sense - and since this is the charset exclusively used by JarFile these changes have little to no effect on startup performance in the cases we've been looking at. Webrev: http://cr.openjdk.java.net/~redestad/8243254/open.00/ Bug: https://bugs.openjdk.java.net/browse/JDK-8243254 I've partially repurposed ZipFile/TestZipFile to test some such corner cases. Mainly expanded it to test ZipFiles created with various non- standard encodings, but also slimmed it down so that it can actually run quickly and reliably - as well as enabled it in regular testing. The updated test fails both before and after JDK-8242959, but passes with the proposed changes. Testing: tier1+2 Thanks! /Claes [1] https://twitter.com/retronym/status/1252134723431747584?s=20
Thanks for doing this. (even though I would not have) I'd be more inclined to have separate code for UTF-8 encoded zip files, that everyone should be using now, rather than have special code for ASCII-compatible. Perhaps a single ZipCoder subclass Utf8ZipCoder. I am the author of that test infrastructure boiler plate you see at the bottom of tests. It's not useful to modify that. More useful would be to migrate tests to junit or testng, but that's a lot of work. Are we really willing to maintain tests for UTF-8, ASCII-compatible, and ASCII-incompatible code paths? ASCII-compatibility is also deeply embedded in the original C code, which is still there /* * Returns true if the specified entry's name begins with the string * "META-INF/" irrespective of case. */ static int isMetaName(const char *name, int length) { const char *s; if (length < (int)sizeof("META-INF/") - 1) return 0; for (s = "META-INF/"; *s != '\0'; s++) { char c = *name++; // Avoid toupper; it's locale-dependent if (c >= 'a' && c <= 'z') c += 'A' - 'a'; if (*s != c) return 0; } return 1; } On Tue, Apr 21, 2020 at 6:51 AM Claes Redestad <claes.redestad@oracle.com> wrote:
Hi,
ZipFile.getEntry does optimizations to check for directory entries by adding a '/' to the encoded byte array. JDK-8242959 improved on this optimization, but a question was raised Jason Zaugg on whether the optimization is always valid[1]. Turns out it isn't, but only for non-ASCII compatible charsets.
While JarFiles are always assumed to be UTF-8-encoded or compatible, ZipFiles allow arbitrary encoding. E.g., UTF-16 would encode '/' (2F) as either 2F 00 or 00 2F, which means the hash code would differ and a directory "foo/" potentially not be found when looking up "foo". Further complications arise when/if the directory name ends with a code point that might be encoded so that the final byte is 2F, e.g. \u012F.
This patch only enables this low-level optimization when the charset encoding used is known to be ASCII compatible in the sense that 2F will be encoded as single-byte 2F. UTF-8 is compatible in this sense - and since this is the charset exclusively used by JarFile these changes have little to no effect on startup performance in the cases we've been looking at.
Webrev: http://cr.openjdk.java.net/~redestad/8243254/open.00/ Bug: https://bugs.openjdk.java.net/browse/JDK-8243254
I've partially repurposed ZipFile/TestZipFile to test some such corner cases. Mainly expanded it to test ZipFiles created with various non- standard encodings, but also slimmed it down so that it can actually run quickly and reliably - as well as enabled it in regular testing. The updated test fails both before and after JDK-8242959, but passes with the proposed changes.
Testing: tier1+2
Thanks!
/Claes
[1] https://twitter.com/retronym/status/1252134723431747584?s=20
Hi, On 2020-04-21 19:19, Martin Buchholz wrote:
Thanks for doing this. (even though I would not have)
at least I learned a few things doing it. :-)
I'd be more inclined to have separate code for UTF-8 encoded zip files, that everyone should be using now, rather than have special code for ASCII-compatible. Perhaps a single ZipCoder subclass Utf8ZipCoder.
Yes, I think this is reasonable path forward. Eirik has been sending me suggestions to improve on this in similar ways, too. Since we have a few additional optimizations queued up, where it might be useful or even necessary to split the UTF-8 implementation out into a fast path to keep it both correct and sane, I'd like to hold back on well-meaning improvements from this bug fix and focus on making sure the behavior is correct.
I am the author of that test infrastructure boiler plate you see at the bottom of tests. It's not useful to modify that. More useful would be to migrate tests to junit or testng, but that's a lot of work.
IDE saw unused code. :-) I can revert those changes as they're not important.
Are we really willing to maintain tests for UTF-8, ASCII-compatible, and ASCII-incompatible code paths?
If we aren't then I think we should terminally deprecate all the ZipFile constructors that take a Charset argument. Not a call I can make.
ASCII-compatibility is also deeply embedded in the original C code, which is still there
Yep, the assumption go way back, but as long as it's functionality only relevant to jar files it's fine in both impls. Is there any usage left of the C implementation these days? /Claes
/* * Returns true if the specified entry's name begins with the string * "META-INF/" irrespective of case. */ static int isMetaName(const char *name, int length) { const char *s; if (length < (int)sizeof("META-INF/") - 1) return 0; for (s = "META-INF/"; *s != '\0'; s++) { char c = *name++; // Avoid toupper; it's locale-dependent if (c >= 'a' && c <= 'z') c += 'A' - 'a'; if (*s != c) return 0; } return 1; }
On Tue, Apr 21, 2020 at 6:51 AM Claes Redestad <claes.redestad@oracle.com <mailto:claes.redestad@oracle.com>> wrote:
Hi,
ZipFile.getEntry does optimizations to check for directory entries by adding a '/' to the encoded byte array. JDK-8242959 improved on this optimization, but a question was raised Jason Zaugg on whether the optimization is always valid[1]. Turns out it isn't, but only for non-ASCII compatible charsets.
While JarFiles are always assumed to be UTF-8-encoded or compatible, ZipFiles allow arbitrary encoding. E.g., UTF-16 would encode '/' (2F) as either 2F 00 or 00 2F, which means the hash code would differ and a directory "foo/" potentially not be found when looking up "foo". Further complications arise when/if the directory name ends with a code point that might be encoded so that the final byte is 2F, e.g. \u012F.
This patch only enables this low-level optimization when the charset encoding used is known to be ASCII compatible in the sense that 2F will be encoded as single-byte 2F. UTF-8 is compatible in this sense - and since this is the charset exclusively used by JarFile these changes have little to no effect on startup performance in the cases we've been looking at.
Webrev: http://cr.openjdk.java.net/~redestad/8243254/open.00/ Bug: https://bugs.openjdk.java.net/browse/JDK-8243254
I've partially repurposed ZipFile/TestZipFile to test some such corner cases. Mainly expanded it to test ZipFiles created with various non- standard encodings, but also slimmed it down so that it can actually run quickly and reliably - as well as enabled it in regular testing. The updated test fails both before and after JDK-8242959, but passes with the proposed changes.
Testing: tier1+2
Thanks!
/Claes
[1] https://twitter.com/retronym/status/1252134723431747584?s=20
Are we really willing to maintain tests for UTF-8, ASCII-compatible, and ASCII-incompatible code paths?
If we aren't then I think we should terminally deprecate all the ZipFile constructors that take a Charset argument. Not a call I can make.
I mean more generally. You're proposing to fix zip files where the metadata is UTF-16 encoded. But I suspect we aren't testing such zip files at all, and I suspect it's broken in other ways, that we might discover by expanding our zip tests to fully cover such zip files.
ASCII-compatibility is also deeply embedded in the original C code, which is still there
Yep, the assumption go way back, but as long as it's functionality only relevant to jar files it's fine in both impls.
Is there any usage left of the C implementation these days?
Bootclassloader likely doesn't use java code! So -Xbootclasspath/a is implemented using that C code. --- The number one reason to make a zip test "manual" is that it consumes too many resources to run all the time.
On 2020-04-21 20:30, Martin Buchholz wrote:
> > Are we really willing to maintain tests for UTF-8, ASCII-compatible, and > ASCII-incompatible code paths?
If we aren't then I think we should terminally deprecate all the ZipFile constructors that take a Charset argument. Not a call I can make.
I mean more generally. You're proposing to fix zip files where the metadata is UTF-16 encoded.
But I suspect we aren't testing such zip files at all, and I suspect it's broken in other ways, that we might discover by expanding our zip tests to fully cover such zip files.
AFAICT there's a certain scarcity of bug reports of the "this zip file can't be read by java" variety. I'm not sure what you're suggesting here, though, and while there's surely other dragons in this code I'm sure that's a strong argument against fixing issues as we find them? At least until we decide to deprecate them. I'll go sleep on this now, but I think we should probably go ahead with your suggestion to optimize for UTF-8 - while providing a simpler, slower but more easily checked implementation for every other encoding. Depending on how far one goes down that path, quite a few cleanups can be done to the UTF-8 implementation - which might be cause for additional speed-ups.
> > ASCII-compatibility is also deeply embedded in the original C code, > which is still there
Yep, the assumption go way back, but as long as it's functionality only relevant to jar files it's fine in both impls.
Is there any usage left of the C implementation these days?
Bootclassloader likely doesn't use java code! So -Xbootclasspath/a is implemented using that C code.
That's true. Found the relevant places in hotspot that call into libzip/zip_util. Since we can't prepend the bootclasspath any more it seems like it could be rewritten to call java code rather than link into libzip, though.
--- The number one reason to make a zip test "manual" is that it consumes too many resources to run all the time.
That assumes there's someone actually running the tests from time to time. Several of TestZipFile seem to fail even after overcoming the OOMEs. I guess I should leave a version of it intact, file a bug to investigate failures, then create a new slimmed down test that we can add to automatic test coverage. After slimming down TestZipFile it runs in a matter of seconds on my setups, but if I only focus on the reproducer for this issue it should be faster still. /Claes
Hi, new webrev based on discussions here and offline: http://cr.openjdk.java.net/~redestad/8243254/open.01/ - creates a new testng test TestZipFileEncodings, which derives from TestZipFile rather than repurposing that existing test. There is definitely some overlap, but since TestZipFile is only run manually and this new test runs fine (and quickly) in our testing I think that's fine - only implements the optimization for UTF-8, removing the concept of ASCII-compatible ZipCoders. - a few other cleanups Testing: tier1+2 Thanks! /Claes On 2020-04-21 15:52, Claes Redestad wrote:
Hi,
ZipFile.getEntry does optimizations to check for directory entries by adding a '/' to the encoded byte array. JDK-8242959 improved on this optimization, but a question was raised Jason Zaugg on whether the optimization is always valid[1]. Turns out it isn't, but only for non-ASCII compatible charsets.
While JarFiles are always assumed to be UTF-8-encoded or compatible, ZipFiles allow arbitrary encoding. E.g., UTF-16 would encode '/' (2F) as either 2F 00 or 00 2F, which means the hash code would differ and a directory "foo/" potentially not be found when looking up "foo". Further complications arise when/if the directory name ends with a code point that might be encoded so that the final byte is 2F, e.g. \u012F.
This patch only enables this low-level optimization when the charset encoding used is known to be ASCII compatible in the sense that 2F will be encoded as single-byte 2F. UTF-8 is compatible in this sense - and since this is the charset exclusively used by JarFile these changes have little to no effect on startup performance in the cases we've been looking at.
Webrev: http://cr.openjdk.java.net/~redestad/8243254/open.00/ Bug: https://bugs.openjdk.java.net/browse/JDK-8243254
I've partially repurposed ZipFile/TestZipFile to test some such corner cases. Mainly expanded it to test ZipFiles created with various non- standard encodings, but also slimmed it down so that it can actually run quickly and reliably - as well as enabled it in regular testing. The updated test fails both before and after JDK-8242959, but passes with the proposed changes.
Testing: tier1+2
Thanks!
/Claes
[1] https://twitter.com/retronym/status/1252134723431747584?s=20
Hi Claes, The latest version looks good. Thank you for the patch. Best Lance
On Apr 22, 2020, at 6:26 AM, Claes Redestad <claes.redestad@oracle.com> wrote:
Hi,
new webrev based on discussions here and offline:
http://cr.openjdk.java.net/~redestad/8243254/open.01/
- creates a new testng test TestZipFileEncodings, which derives from TestZipFile rather than repurposing that existing test. There is definitely some overlap, but since TestZipFile is only run manually and this new test runs fine (and quickly) in our testing I think that's fine
- only implements the optimization for UTF-8, removing the concept of ASCII-compatible ZipCoders.
- a few other cleanups
Testing: tier1+2
Thanks!
/Claes
On 2020-04-21 15:52, Claes Redestad wrote:
Hi, ZipFile.getEntry does optimizations to check for directory entries by adding a '/' to the encoded byte array. JDK-8242959 improved on this optimization, but a question was raised Jason Zaugg on whether the optimization is always valid[1]. Turns out it isn't, but only for non-ASCII compatible charsets. While JarFiles are always assumed to be UTF-8-encoded or compatible, ZipFiles allow arbitrary encoding. E.g., UTF-16 would encode '/' (2F) as either 2F 00 or 00 2F, which means the hash code would differ and a directory "foo/" potentially not be found when looking up "foo". Further complications arise when/if the directory name ends with a code point that might be encoded so that the final byte is 2F, e.g. \u012F. This patch only enables this low-level optimization when the charset encoding used is known to be ASCII compatible in the sense that 2F will be encoded as single-byte 2F. UTF-8 is compatible in this sense - and since this is the charset exclusively used by JarFile these changes have little to no effect on startup performance in the cases we've been looking at. Webrev: http://cr.openjdk.java.net/~redestad/8243254/open.00/ Bug: https://bugs.openjdk.java.net/browse/JDK-8243254 I've partially repurposed ZipFile/TestZipFile to test some such corner cases. Mainly expanded it to test ZipFiles created with various non- standard encodings, but also slimmed it down so that it can actually run quickly and reliably - as well as enabled it in regular testing. The updated test fails both before and after JDK-8242959, but passes with the proposed changes. Testing: tier1+2 Thanks! /Claes [1] https://twitter.com/retronym/status/1252134723431747584?s=20
<http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif>Lance Andersen| Principal Member of Technical Staff | +1.781.442.2037 Oracle Java Engineering 1 Network Drive Burlington, MA 01803 Lance.Andersen@oracle.com <mailto:Lance.Andersen@oracle.com>
+1 Naoto On 4/22/20 9:13 AM, Lance Andersen wrote:
Hi Claes,
The latest version looks good.
Thank you for the patch.
Best Lance
On Apr 22, 2020, at 6:26 AM, Claes Redestad <claes.redestad@oracle.com> wrote:
Hi,
new webrev based on discussions here and offline:
http://cr.openjdk.java.net/~redestad/8243254/open.01/
- creates a new testng test TestZipFileEncodings, which derives from TestZipFile rather than repurposing that existing test. There is definitely some overlap, but since TestZipFile is only run manually and this new test runs fine (and quickly) in our testing I think that's fine
- only implements the optimization for UTF-8, removing the concept of ASCII-compatible ZipCoders.
- a few other cleanups
Testing: tier1+2
Thanks!
/Claes
On 2020-04-21 15:52, Claes Redestad wrote:
Hi, ZipFile.getEntry does optimizations to check for directory entries by adding a '/' to the encoded byte array. JDK-8242959 improved on this optimization, but a question was raised Jason Zaugg on whether the optimization is always valid[1]. Turns out it isn't, but only for non-ASCII compatible charsets. While JarFiles are always assumed to be UTF-8-encoded or compatible, ZipFiles allow arbitrary encoding. E.g., UTF-16 would encode '/' (2F) as either 2F 00 or 00 2F, which means the hash code would differ and a directory "foo/" potentially not be found when looking up "foo". Further complications arise when/if the directory name ends with a code point that might be encoded so that the final byte is 2F, e.g. \u012F. This patch only enables this low-level optimization when the charset encoding used is known to be ASCII compatible in the sense that 2F will be encoded as single-byte 2F. UTF-8 is compatible in this sense - and since this is the charset exclusively used by JarFile these changes have little to no effect on startup performance in the cases we've been looking at. Webrev: http://cr.openjdk.java.net/~redestad/8243254/open.00/ Bug: https://bugs.openjdk.java.net/browse/JDK-8243254 I've partially repurposed ZipFile/TestZipFile to test some such corner cases. Mainly expanded it to test ZipFiles created with various non- standard encodings, but also slimmed it down so that it can actually run quickly and reliably - as well as enabled it in regular testing. The updated test fails both before and after JDK-8242959, but passes with the proposed changes. Testing: tier1+2 Thanks! /Claes [1] https://twitter.com/retronym/status/1252134723431747584?s=20
<http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif>Lance Andersen| Principal Member of Technical Staff | +1.781.442.2037 Oracle Java Engineering 1 Network Drive Burlington, MA 01803 Lance.Andersen@oracle.com <mailto:Lance.Andersen@oracle.com>
Lance, Naoto, thanks for reviewing! /Claes On 2020-04-22 18:19, naoto.sato@oracle.com wrote:
+1
Naoto
On 4/22/20 9:13 AM, Lance Andersen wrote:
Hi Claes,
The latest version looks good.
Thank you for the patch.
Best Lance
On Apr 22, 2020, at 6:26 AM, Claes Redestad <claes.redestad@oracle.com> wrote:
Hi,
new webrev based on discussions here and offline:
http://cr.openjdk.java.net/~redestad/8243254/open.01/
- creates a new testng test TestZipFileEncodings, which derives from TestZipFile rather than repurposing that existing test. There is definitely some overlap, but since TestZipFile is only run manually and this new test runs fine (and quickly) in our testing I think that's fine
- only implements the optimization for UTF-8, removing the concept of ASCII-compatible ZipCoders.
- a few other cleanups
Testing: tier1+2
Thanks!
/Claes
On 2020-04-21 15:52, Claes Redestad wrote:
Hi, ZipFile.getEntry does optimizations to check for directory entries by adding a '/' to the encoded byte array. JDK-8242959 improved on this optimization, but a question was raised Jason Zaugg on whether the optimization is always valid[1]. Turns out it isn't, but only for non-ASCII compatible charsets. While JarFiles are always assumed to be UTF-8-encoded or compatible, ZipFiles allow arbitrary encoding. E.g., UTF-16 would encode '/' (2F) as either 2F 00 or 00 2F, which means the hash code would differ and a directory "foo/" potentially not be found when looking up "foo". Further complications arise when/if the directory name ends with a code point that might be encoded so that the final byte is 2F, e.g. \u012F. This patch only enables this low-level optimization when the charset encoding used is known to be ASCII compatible in the sense that 2F will be encoded as single-byte 2F. UTF-8 is compatible in this sense - and since this is the charset exclusively used by JarFile these changes have little to no effect on startup performance in the cases we've been looking at. Webrev: http://cr.openjdk.java.net/~redestad/8243254/open.00/ Bug: https://bugs.openjdk.java.net/browse/JDK-8243254 I've partially repurposed ZipFile/TestZipFile to test some such corner cases. Mainly expanded it to test ZipFiles created with various non- standard encodings, but also slimmed it down so that it can actually run quickly and reliably - as well as enabled it in regular testing. The updated test fails both before and after JDK-8242959, but passes with the proposed changes. Testing: tier1+2 Thanks! /Claes [1] https://twitter.com/retronym/status/1252134723431747584?s=20
<http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif> <http://oracle.com/us/design/oracle-email-sig-198324.gif>Lance Andersen| Principal Member of Technical Staff | +1.781.442.2037 Oracle Java Engineering 1 Network Drive Burlington, MA 01803 Lance.Andersen@oracle.com <mailto:Lance.Andersen@oracle.com>
participants (4)
-
Claes Redestad
-
Lance Andersen
-
Martin Buchholz
-
naoto.sato@oracle.com