RFR: 8243469: Lazily encode name in ZipFile.getEntryPos

Mon Apr 27 09:49:31 UTC 2020

On Sun, Apr 26, 2020 at 11:34 PM Claes Redestad
<claes.redestad at oracle.com> wrote:
>
> Hi again,
>
> On 2020-04-24 21:22, Claes Redestad wrote:
> >> It seems that 'getEntryHitUncached' is getting slightly slower with
> >> your change while all the other variants get significantly faster. I
> >> don't think that's a problem, but do you have an explanation why
> >> that's the case?
> >
> > I've noticed it swing a bit either way, and have been asking myself the
> > same thing. After a little analysis I think it's actually a bug in my
> > microbenchmark: I'm always looking up the same entry, and thus hitting
> > the same bucket in the hash table. If that one has a collision, we'll do
> > a few extra passes. If not, we won't. This might be reflected as a
> > significant swing in either direction.
> >
> > I'm going to try rewriting it to consider more (if not all) entries in
> > the zip file. That should mean the cost averages out a bit.
>
> after I improved my micro to root out sources of variance, the
> performance issue for hits persisted.
>
> Luckily Eirik had a brilliant idea: Why not decode the bytes in the
> cen to a String and compare that, rather than the other way around?
> To some surprise it turns out this gives us about a ~1.2x speedup for
> getEntryHit and getEntryHitUncached over open.00 - and comfortably
> just ahead of the baseline on getEntryHitUncached[1]. It also leads to
> slightly cleaner code[2].
>
> Webrev: http://cr.openjdk.java.net/~redestad/8243469/open.01/
>
> The speed-up appears to come from String.equals, which is intrinsified
> and significantly faster than the replaced loop. I profiled allocation
> per operation and it stays the same (EA removes the String).
>

Great! Another nice improvement. The changes look good to me.
Following just two minor remarks:

In ZipCoder.normalizedHashDecode() you've changed the line:

if (limit > 0 && decoded[limit - 1] != '/') {

to:

if (limit > pos && decoded[limit - 1] != '/') {

which was first a little confusing to me. But in the end it turns out
that this is semantically the same, because the
CharsetDecoder.decode() method called before is guaranteed to return a
"newly-allocated character buffer" and its "position will be zero and
its limit will follow the last character written". This also explains
why you don't have to take the CharBuffer's "arrayOffset()" into
account if you use the CharBuffer's backing array (because it will
always be 0 for newly created buffers). So maybe you can put in some
comments to make it less confusing for the ingenuous reader:

CharBuffer cb = decoder().decode(ByteBuffer.wrap(a, off, end - off));
// 'cb' is a newly allocated CharBuffer with 'pos == 0'
int pos = cb.position();
int limit = cb.limit();
char[] decoded;
if (cb.hasArray()) {
    // 'cb.arrayOffset()' is zero for newly allocated CharBuffers
    decoded = cb.array();
} else {
    decoded = new char[limit - pos];
    cb.get(decoded);
}

I think you can also remove the "else" branch (and maybe replace it
with an assertion) because newly allocated CharBuffers are guaranteed
to be backed by an array with array offset zero (see
https://docs.oracle.com/en/java/javase/14/docs/api/java.base/java/nio/CharBuffer.html#allocate(int)
).

Zipcoder.get() seems to be the only remaining if block without braces.
Maybe you'll wnat to fix that once your on it?

    public static ZipCoder get(Charset charset) {
        if (charset == UTF_8.INSTANCE)
            return UTF8;
        return new ZipCoder(charset);

Thumbs up from my side. There's no need for a new webrev from my side.

Best regards,
Volker

> Testing: tier1-4
>
> Thanks!
>
> /Claes
>
> [1]
> Baseline:
> Benchmark                           (size) Mode Cnt   Score   Error Units
> ZipFileGetEntry.getEntryHit              512  avgt   15  126.264 ± 5.297
>   ns/op
> ZipFileGetEntry.getEntryHit             1024  avgt   15  130.823 ± 7.212
>   ns/op
> ZipFileGetEntry.getEntryHitUncached      512  avgt   15  152.149 ± 4.978
>   ns/op
> ZipFileGetEntry.getEntryHitUncached     1024  avgt   15  151.527 ± 4.054
>   ns/op
>
> open.01:
> Benchmark                             (size)  Mode  Cnt    Score   Error
>   Units
> ZipFileGetEntry.getEntryHit              512  avgt   15   84.450 ± 5.474
>   ns/op
> ZipFileGetEntry.getEntryHit             1024  avgt   15   85.224 ± 3.776
>   ns/op
> ZipFileGetEntry.getEntryHitUncached      512  avgt   15  140.448 ± 4.667
>   ns/op
> ZipFileGetEntry.getEntryHitUncached     1024  avgt   15  145.046 ± 7.363
>
> [2] I stopped short of taking the cleanup a step further by decoding to
> String even in initCEN, which sadly isn't performance neutral:
>
> http://cr.openjdk.java.net/~redestad/8243469/open.01.init_decode/
>
> Something for the future to consider, maybe.