RFR: 8243254: Examine ZipFile slash optimization for non-ASCII compatible charsets

Martin Buchholz martinrb at google.com
Tue Apr 21 17:19:38 UTC 2020


Thanks for doing this.  (even though I would not have)

I'd be more inclined to have separate code for UTF-8 encoded zip files,
that everyone should be using now, rather than have special code for
ASCII-compatible.  Perhaps a single ZipCoder subclass Utf8ZipCoder.

I am the author of that test infrastructure boiler plate you see at the
bottom of tests.  It's not useful to modify that.  More useful would be to
migrate tests to junit or testng, but that's a lot of work.

Are we really willing to maintain tests for UTF-8, ASCII-compatible, and
ASCII-incompatible code paths?

ASCII-compatibility is also deeply embedded in the original C code, which
is still there

/*
 * Returns true if the specified entry's name begins with the string
 * "META-INF/" irrespective of case.
 */
static int
isMetaName(const char *name, int length)
{
    const char *s;
    if (length < (int)sizeof("META-INF/") - 1)
        return 0;
    for (s = "META-INF/"; *s != '\0'; s++) {
        char c = *name++;
        // Avoid toupper; it's locale-dependent
        if (c >= 'a' && c <= 'z') c += 'A' - 'a';
        if (*s != c)
            return 0;
    }
    return 1;
}

On Tue, Apr 21, 2020 at 6:51 AM Claes Redestad <claes.redestad at oracle.com>
wrote:

> Hi,
>
> ZipFile.getEntry does optimizations to check for directory entries by
> adding a '/' to the encoded byte array. JDK-8242959 improved on this
> optimization, but a question was raised Jason Zaugg on whether the
> optimization is always valid[1]. Turns out it isn't, but only for
> non-ASCII compatible charsets.
>
> While JarFiles are always assumed to be UTF-8-encoded or compatible,
> ZipFiles allow arbitrary encoding. E.g., UTF-16 would encode '/' (2F) as
> either 2F 00 or 00 2F, which means the hash code would differ and a
> directory "foo/" potentially not be found when looking up "foo". Further
> complications arise when/if the directory name ends with a code point
> that might be encoded so that the final byte is 2F, e.g. \u012F.
>
> This patch only enables this low-level optimization when the charset
> encoding used is known to be ASCII compatible in the sense that 2F will
> be encoded as single-byte 2F. UTF-8 is compatible in this sense - and
> since this is the charset exclusively used by JarFile these changes have
> little to no effect on startup performance in the cases we've been
> looking at.
>
> Webrev: http://cr.openjdk.java.net/~redestad/8243254/open.00/
> Bug:    https://bugs.openjdk.java.net/browse/JDK-8243254
>
> I've partially repurposed ZipFile/TestZipFile to test some such corner
> cases. Mainly expanded it to test ZipFiles created with various non-
> standard encodings, but also slimmed it down so that it can actually run
> quickly and reliably - as well as enabled it in regular testing. The
> updated test fails both before and after JDK-8242959, but passes with
> the proposed changes.
>
> Testing: tier1+2
>
> Thanks!
>
> /Claes
>
> [1] https://twitter.com/retronym/status/1252134723431747584?s=20
>


More information about the core-libs-dev mailing list