Improving ZipFile.getEntryPos double loop queries

Claes Redestad claes.redestad at oracle.com
Thu Apr 16 12:26:17 UTC 2020


Hi Eirik,

On 2020-04-16 13:47, Eirik Bjørsnøs wrote:
> On Thu, Apr 16, 2020 at 12:59 PM Claes Redestad 
> <claes.redestad at oracle.com <mailto:claes.redestad at oracle.com>> wrote:
> 
> 
>     I think this is an interesting idea and a good optimization.
> 
> 
> Looks like we get most of the performance win of Bloom filters while not 
> introducing regressions for hits, footprint or complexity.

Agreed. Footprint neutral, sizeable gain and IMO just shifts existing
complexity around a bit.

> 
>     We'll
>     deliberately cause a few more hash collisions, 
> 
> 
> Will ZIP files commonly include entries for both "META-INF" and "META-INF/"?

True, it should be an either/or for typical jar files.

> 
>     Some issues:
> 
>     - Need to be made to work on empty strings (hashN([], 0, [].length) ->
>     a[off + len - 1] -> a[-1] -> AIOOBE)
> 
> 
> Oops. Should be safe to check for len > 0, right >
>      if(len > 0 && a[off + len -1] == '/')
>          len--;

That should work.

> 
>     - If name actually ends with a / we shouldn't look for name + /,
>     e.g.:
> 
>          (addSlash && name.length == nlen - 1 && name.length > 0 &&
>     name[name.length - 1] != '/' && cen[pos + CENHDR + nlen - 1] == '/')
> 
> 
> Fixed and added some comments for readability.

Thanks!

> 
> Here's an updated patch:
> 
> diff --git a/src/java.base/share/classes/java/util/zip/ZipFile.java 
> b/src/java.base/share/classes/java/util/zip/ZipFile.java
> index 274e8788d1..ba5101e0db 100644
> --- a/src/java.base/share/classes/java/util/zip/ZipFile.java
> +++ b/src/java.base/share/classes/java/util/zip/ZipFile.java
> @@ -1266,6 +1266,11 @@ public class ZipFile implements ZipConstants, 
> Closeable {
>           }
> 
>           private static final int hashN(byte[] a, int off, int len) {
> +            // Performance optimization: getEntryPos assumes that the hash
> +            // of a name remains unchanged when appending a trailing  '/'.
> +            if(len > 0 && a[off + len -1] == '/')
> +                len--;
> +
>               int h = 1;
>               while (len-- > 0) {
>                   h = 31 * h + a[off++];
> @@ -1273,10 +1278,6 @@ public class ZipFile implements ZipConstants, 
> Closeable {
>               return h;
>           }
> 
> -        private static final int hash_append(int hash, byte b) {
> -            return hash * 31 + b;
> -        }
> -
>           private static class End {
>               int  centot;     // 4 bytes
>               long cenlen;     // 4 bytes
> @@ -1493,48 +1494,36 @@ public class ZipFile implements ZipConstants, 
> Closeable {
>               int hsh = hashN(name, 0, name.length);
>               int idx = table[(hsh & 0x7fffffff) % tablelen];
>               /*
> -             * This while loop is an optimization where a double lookup
> -             * for name and name+/ is being performed. The name char
> -             * array has enough room at the end to try again with a
> -             * slash appended if the first table lookup does not succeed.
> +             * Search down the target hash chain for a entry whose
> +             * 32 bit hash matches the hashed name.
>                */
> -            while (true) {
> -                /*
> -                 * Search down the target hash chain for a entry whose
> -                 * 32 bit hash matches the hashed name.
> -                 */
> -                while (idx != ZIP_ENDCHAIN) {
> -                    if (getEntryHash(idx) == hsh) {
> -                        // The CEN name must match the specfied one
> -                        int pos = getEntryPos(idx);
> -                        if (name.length == CENNAM(cen, pos)) {
> -                            boolean matched = true;
> -                            int nameoff = pos + CENHDR;
> -                            for (int i = 0; i < name.length; i++) {
> -                                if (name[i] != cen[nameoff++]) {
> -                                    matched = false;
> -                                    break;
> -                                }
> -                            }
> -                            if (matched) {
> -                                return pos;
> +            while (idx != ZIP_ENDCHAIN) {
> +                if (getEntryHash(idx) == hsh) {
> +                    // The CEN name must match the specfied one
> +                    int pos = getEntryPos(idx);
> +                    final int nlen = CENNAM(cen, pos);
> +                    final int nstart = pos + CENHDR;
> +                    final boolean allowSlash = addSlash       // we 
> should match names with a trailing '/'
> +                            && nlen == name.length + 1        // nlen 
> has one extra byte
> +                            && cen[nstart + nlen - 1] == '/'  // that 
> byte is a '/'
> +                            && name[name.length - 1] != '/';  // the 
> name does not already end with '/'

I'd put a long comment like this on separate lines - before allowSlash.

> +                    if (name.length == nlen || allowSlash) {
> +                        boolean matched = true;
> +                        int nameoff = pos + CENHDR;
> +                        for (int i = 0; i < name.length; i++) {
> +                            if (name[i] != cen[nameoff++]) {
> +                                matched = false;
> +                                break;
>                               }
> -                         }
> +                        }
> +                        if (matched) {
> +                            return pos;
> +                        }
>                       }
> -                    idx = getEntryNext(idx);
> -                }
> -                /* If not addSlash, or slash is already there, we are 
> done */
> -                if (!addSlash  || name.length == 0 || name[name.length 
> - 1] == '/') {
> -                     return -1;
>                   }
> -                /* Add slash and try once more */
> -                name = Arrays.copyOf(name, name.length + 1);
> -                name[name.length - 1] = '/';
> -                hsh = hash_append(hsh, (byte)'/');
> -                //idx = table[hsh % tablelen];
> -                idx = table[(hsh & 0x7fffffff) % tablelen];
> -                addSlash = false;
> +                idx = getEntryNext(idx);
>               }
> +            return -1;
>           }
> 
>           /**
> 

Looks good, modulo some style nits. I'll sponsor.

Thanks!

/Claes


More information about the core-libs-dev mailing list