Improving ZipFile.getEntryPos double loop queries
Claes Redestad
claes.redestad at oracle.com
Thu Apr 16 11:00:51 UTC 2020
Hi,
I think this is an interesting idea and a good optimization. We'll
deliberately cause a few more hash collisions, but since we do half as
many hash table lookups in the normal case (and the streaming/iterators
shouldn't care) then that should be fine.
Some issues:
- Need to be made to work on empty strings (hashN([], 0, [].length) ->
a[off + len - 1] -> a[-1] -> AIOOBE)
- If name actually ends with a / we shouldn't look for name + /,
e.g.:
(addSlash && name.length == nlen - 1 && name.length > 0 &&
name[name.length - 1] != '/' && cen[pos + CENHDR + nlen - 1] == '/')
/Claes
On 2020-04-16 09:35, Eirik Bjørsnøs wrote:
> Hi,
>
> ZipEntry.getEntryPos currently has a double loop which retries search after
> a failed lookup by appending a '/' to the name/hash. This means that any
> miss needs to query the hash table twice.
>
> The following patch updates hashN to truncate any '/' at the end of an
> entry name. This ensures that the hash of META-INF and META-INF/ will
> always collide and the double query loop is no longer needed.
>
> I'm seeing ~35% time saving on misses and a slight improvement on hits.
> Removing the double loop also improves readability.
>
> diff --git a/src/java.base/share/classes/java/util/zip/ZipFile.java
> b/src/java.base/share/classes/java/util/zip/ZipFile.java
> index 274e8788d1..fd6a4e276c 100644
> --- a/src/java.base/share/classes/java/util/zip/ZipFile.java
> +++ b/src/java.base/share/classes/java/util/zip/ZipFile.java
> @@ -1266,6 +1266,9 @@ public class ZipFile implements ZipConstants,
> Closeable {
> }
>
> private static final int hashN(byte[] a, int off, int len) {
> + if(a[off + len -1] == '/') {
> + len--;
> + }
> int h = 1;
> while (len-- > 0) {
> h = 31 * h + a[off++];
> @@ -1493,48 +1496,31 @@ public class ZipFile implements ZipConstants,
> Closeable {
> int hsh = hashN(name, 0, name.length);
> int idx = table[(hsh & 0x7fffffff) % tablelen];
> /*
> - * This while loop is an optimization where a double lookup
> - * for name and name+/ is being performed. The name char
> - * array has enough room at the end to try again with a
> - * slash appended if the first table lookup does not succeed.
> + * Search down the target hash chain for a entry whose
> + * 32 bit hash matches the hashed name.
> */
> - while (true) {
> - /*
> - * Search down the target hash chain for a entry whose
> - * 32 bit hash matches the hashed name.
> - */
> - while (idx != ZIP_ENDCHAIN) {
> - if (getEntryHash(idx) == hsh) {
> - // The CEN name must match the specfied one
> - int pos = getEntryPos(idx);
> - if (name.length == CENNAM(cen, pos)) {
> - boolean matched = true;
> - int nameoff = pos + CENHDR;
> - for (int i = 0; i < name.length; i++) {
> - if (name[i] != cen[nameoff++]) {
> - matched = false;
> - break;
> - }
> - }
> - if (matched) {
> - return pos;
> + while (idx != ZIP_ENDCHAIN) {
> + if (getEntryHash(idx) == hsh) {
> + // The CEN name must match the specfied one
> + int pos = getEntryPos(idx);
> + final int nlen = CENNAM(cen, pos);
> + if (name.length == nlen || (addSlash && name.length ==
> nlen - 1 && cen[pos + CENHDR + nlen - 1] == '/')) {
> + boolean matched = true;
> + int nameoff = pos + CENHDR;
> + for (int i = 0; i < name.length; i++) {
> + if (name[i] != cen[nameoff++]) {
> + matched = false;
> + break;
> }
> - }
> + }
> + if (matched) {
> + return pos;
> + }
> }
> - idx = getEntryNext(idx);
> - }
> - /* If not addSlash, or slash is already there, we are done
> */
> - if (!addSlash || name.length == 0 || name[name.length -
> 1] == '/') {
> - return -1;
> }
> - /* Add slash and try once more */
> - name = Arrays.copyOf(name, name.length + 1);
> - name[name.length - 1] = '/';
> - hsh = hash_append(hsh, (byte)'/');
> - //idx = table[hsh % tablelen];
> - idx = table[(hsh & 0x7fffffff) % tablelen];
> - addSlash = false;
> + idx = getEntryNext(idx);
> }
> + return -1;
> }
>
> /**
>
More information about the core-libs-dev
mailing list