RFR: 8243254: Examine ZipFile slash optimization for non-ASCII compatible charsets
Claes Redestad
claes.redestad at oracle.com
Tue Apr 21 18:15:48 UTC 2020
Hi,
On 2020-04-21 19:19, Martin Buchholz wrote:
> Thanks for doing this. (even though I would not have)
at least I learned a few things doing it. :-)
>
> I'd be more inclined to have separate code for UTF-8 encoded zip files,
> that everyone should be using now, rather than have special code for
> ASCII-compatible. Perhaps a single ZipCoder subclass Utf8ZipCoder.
Yes, I think this is reasonable path forward. Eirik has been sending me
suggestions to improve on this in similar ways, too.
Since we have a few additional optimizations queued up, where it might
be useful or even necessary to split the UTF-8 implementation out into a
fast path to keep it both correct and sane, I'd like to hold back on
well-meaning improvements from this bug fix and focus on making sure the
behavior is correct.
>
> I am the author of that test infrastructure boiler plate you see at the
> bottom of tests. It's not useful to modify that. More useful would be
> to migrate tests to junit or testng, but that's a lot of work.
IDE saw unused code. :-) I can revert those changes as they're not
important.
>
> Are we really willing to maintain tests for UTF-8, ASCII-compatible, and
> ASCII-incompatible code paths?
If we aren't then I think we should terminally deprecate all the ZipFile
constructors that take a Charset argument. Not a call I can make.
>
> ASCII-compatibility is also deeply embedded in the original C code,
> which is still there
Yep, the assumption go way back, but as long as it's functionality only
relevant to jar files it's fine in both impls.
Is there any usage left of the C implementation these days?
/Claes
>
> /*
> * Returns true if the specified entry's name begins with the string
> * "META-INF/" irrespective of case.
> */
> static int
> isMetaName(const char *name, int length)
> {
> const char *s;
> if (length < (int)sizeof("META-INF/") - 1)
> return 0;
> for (s = "META-INF/"; *s != '\0'; s++) {
> char c = *name++;
> // Avoid toupper; it's locale-dependent
> if (c >= 'a' && c <= 'z') c += 'A' - 'a';
> if (*s != c)
> return 0;
> }
> return 1;
> }
>
> On Tue, Apr 21, 2020 at 6:51 AM Claes Redestad
> <claes.redestad at oracle.com <mailto:claes.redestad at oracle.com>> wrote:
>
> Hi,
>
> ZipFile.getEntry does optimizations to check for directory entries by
> adding a '/' to the encoded byte array. JDK-8242959 improved on this
> optimization, but a question was raised Jason Zaugg on whether the
> optimization is always valid[1]. Turns out it isn't, but only for
> non-ASCII compatible charsets.
>
> While JarFiles are always assumed to be UTF-8-encoded or compatible,
> ZipFiles allow arbitrary encoding. E.g., UTF-16 would encode '/' (2F) as
> either 2F 00 or 00 2F, which means the hash code would differ and a
> directory "foo/" potentially not be found when looking up "foo". Further
> complications arise when/if the directory name ends with a code point
> that might be encoded so that the final byte is 2F, e.g. \u012F.
>
> This patch only enables this low-level optimization when the charset
> encoding used is known to be ASCII compatible in the sense that 2F will
> be encoded as single-byte 2F. UTF-8 is compatible in this sense - and
> since this is the charset exclusively used by JarFile these changes have
> little to no effect on startup performance in the cases we've been
> looking at.
>
> Webrev: http://cr.openjdk.java.net/~redestad/8243254/open.00/
> Bug: https://bugs.openjdk.java.net/browse/JDK-8243254
>
> I've partially repurposed ZipFile/TestZipFile to test some such corner
> cases. Mainly expanded it to test ZipFiles created with various non-
> standard encodings, but also slimmed it down so that it can actually run
> quickly and reliably - as well as enabled it in regular testing. The
> updated test fails both before and after JDK-8242959, but passes with
> the proposed changes.
>
> Testing: tier1+2
>
> Thanks!
>
> /Claes
>
> [1] https://twitter.com/retronym/status/1252134723431747584?s=20
>
More information about the core-libs-dev
mailing list