RFR: 8243469: Lazily encode name in ZipFile.getEntryPos

Mon Apr 27 10:11:38 UTC 2020

On 2020-04-27 11:49, Volker Simonis wrote:
> On Sun, Apr 26, 2020 at 11:34 PM Claes Redestad
> <claes.redestad at oracle.com> wrote:
>>
>> Hi again,
>>
>> On 2020-04-24 21:22, Claes Redestad wrote:
>>>> It seems that 'getEntryHitUncached' is getting slightly slower with
>>>> your change while all the other variants get significantly faster. I
>>>> don't think that's a problem, but do you have an explanation why
>>>> that's the case?
>>>
>>> I've noticed it swing a bit either way, and have been asking myself the
>>> same thing. After a little analysis I think it's actually a bug in my
>>> microbenchmark: I'm always looking up the same entry, and thus hitting
>>> the same bucket in the hash table. If that one has a collision, we'll do
>>> a few extra passes. If not, we won't. This might be reflected as a
>>> significant swing in either direction.
>>>
>>> I'm going to try rewriting it to consider more (if not all) entries in
>>> the zip file. That should mean the cost averages out a bit.
>>
>> after I improved my micro to root out sources of variance, the
>> performance issue for hits persisted.
>>
>> Luckily Eirik had a brilliant idea: Why not decode the bytes in the
>> cen to a String and compare that, rather than the other way around?
>> To some surprise it turns out this gives us about a ~1.2x speedup for
>> getEntryHit and getEntryHitUncached over open.00 - and comfortably
>> just ahead of the baseline on getEntryHitUncached[1]. It also leads to
>> slightly cleaner code[2].
>>
>> Webrev: http://cr.openjdk.java.net/~redestad/8243469/open.01/
>>
>> The speed-up appears to come from String.equals, which is intrinsified
>> and significantly faster than the replaced loop. I profiled allocation
>> per operation and it stays the same (EA removes the String).
>>
> 
> Great! Another nice improvement. The changes look good to me.

Thanks!

> Following just two minor remarks:
> 
> In ZipCoder.normalizedHashDecode() you've changed the line:
> 
> if (limit > 0 && decoded[limit - 1] != '/') {
> 
> to:
> 
> if (limit > pos && decoded[limit - 1] != '/') {
> 
> which was first a little confusing to me. But in the end it turns out
> that this is semantically the same, because the
> CharsetDecoder.decode() method called before is guaranteed to return a
> "newly-allocated character buffer" and its "position will be zero and
> its limit will follow the last character written". This also explains
> why you don't have to take the CharBuffer's "arrayOffset()" into
> account if you use the CharBuffer's backing array (because it will
> always be 0 for newly created buffers). So maybe you can put in some
> comments to make it less confusing for the ingenuous reader:
> 
> CharBuffer cb = decoder().decode(ByteBuffer.wrap(a, off, end - off));
> // 'cb' is a newly allocated CharBuffer with 'pos == 0'
> int pos = cb.position();
> int limit = cb.limit();
> char[] decoded;
> if (cb.hasArray()) {
>      // 'cb.arrayOffset()' is zero for newly allocated CharBuffers
>      decoded = cb.array();
> } else {
>      decoded = new char[limit - pos];
>      cb.get(decoded);
> }
> 
> I think you can also remove the "else" branch (and maybe replace it
> with an assertion) because newly allocated CharBuffers are guaranteed
> to be backed by an array with array offset zero (see
> https://docs.oracle.com/en/java/javase/14/docs/api/java.base/java/nio/CharBuffer.html#allocate(int)
> ).

Yes, it does seem the specification is pretty strong here and we can
assume that pos == 0, arrayOffset == 0 and cb.hasArray() == true.

I'll simplify to:

             // 'cb' is a newly allocated CharBuffer with 'pos == 0',
             // 'arrayOffset == 0', backed by an array.
             CharBuffer cb = decoder().decode(ByteBuffer.wrap(a, off, 
end - off));
             int limit = cb.limit();
             char[] decoded = cb.array();
             for (int i = 0; i < limit; i++) {
                 h = 31 * h + decoded[i];
             }
             if (limit > 0 && decoded[limit - 1] != '/') {
                 h = 31 * h + '/';
             }

An assert seems like overkill.

> 
> Zipcoder.get() seems to be the only remaining if block without braces.
> Maybe you'll wnat to fix that once your on it?
> 
>      public static ZipCoder get(Charset charset) {
>          if (charset == UTF_8.INSTANCE)
>              return UTF8;
>          return new ZipCoder(charset);

Sure.

/Claes

> 
> Thumbs up from my side. There's no need for a new webrev from my side.
> 
> Best regards,
> Volker
> 
>> Testing: tier1-4
>>
>> Thanks!
>>
>> /Claes
>>
>> [1]
>> Baseline:
>> Benchmark                           (size) Mode Cnt   Score   Error Units
>> ZipFileGetEntry.getEntryHit              512  avgt   15  126.264 ± 5.297
>>    ns/op
>> ZipFileGetEntry.getEntryHit             1024  avgt   15  130.823 ± 7.212
>>    ns/op
>> ZipFileGetEntry.getEntryHitUncached      512  avgt   15  152.149 ± 4.978
>>    ns/op
>> ZipFileGetEntry.getEntryHitUncached     1024  avgt   15  151.527 ± 4.054
>>    ns/op
>>
>> open.01:
>> Benchmark                             (size)  Mode  Cnt    Score   Error
>>    Units
>> ZipFileGetEntry.getEntryHit              512  avgt   15   84.450 ± 5.474
>>    ns/op
>> ZipFileGetEntry.getEntryHit             1024  avgt   15   85.224 ± 3.776
>>    ns/op
>> ZipFileGetEntry.getEntryHitUncached      512  avgt   15  140.448 ± 4.667
>>    ns/op
>> ZipFileGetEntry.getEntryHitUncached     1024  avgt   15  145.046 ± 7.363
>>
>> [2] I stopped short of taking the cleanup a step further by decoding to
>> String even in initCEN, which sadly isn't performance neutral:
>>
>> http://cr.openjdk.java.net/~redestad/8243469/open.01.init_decode/
>>
>> Something for the future to consider, maybe.