RFR: JDK-8300808: Accelerate Base64 on x86 for AVX2 [v6]

Fri Jan 27 22:24:23 UTC 2023

On Fri, 27 Jan 2023 18:31:50 GMT, Scott Gibbons <duke at openjdk.org> wrote:

>> Added code for Base64 acceleration (encode and decode) which will accelerate ~4x for AVX2 platforms.
>> 
>> Encode performance:
>> **Old:**
>> 
>> Benchmark                      (maxNumBytes)   Mode  Cnt     Score   Error   Units
>> Base64Encode.testBase64Encode           1024  thrpt    3  4309.439 ± 2.632  ops/ms
>> 
>> 
>> **New:**
>> 
>> Benchmark                      (maxNumBytes)   Mode  Cnt      Score     Error   Units
>> Base64Encode.testBase64Encode           1024  thrpt    3  24211.397 ± 102.026  ops/ms
>> 
>> 
>> Decode performance:
>> **Old:**
>> 
>> Benchmark                      (errorIndex)  (lineSize)  (maxNumBytes)   Mode  Cnt     Score    Error   Units
>> Base64Decode.testBase64Decode           144           4           1024  thrpt    3  3961.768 ± 93.409  ops/ms
>> 
>> **New:**
>> Benchmark                      (errorIndex)  (lineSize)  (maxNumBytes)   Mode  Cnt      Score    Error   Units
>> Base64Decode.testBase64Decode           144           4           1024  thrpt    3  14738.051 ± 24.383  ops/ms
>
> Scott Gibbons has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 13 additional commits since the last revision:
> 
>  - Merge branch 'openjdk:master' into Base64-AVX2
>  - Merge branch 'openjdk:master' into Base64-AVX2
>  - Merge branch 'openjdk:master' into Base64-AVX2
>  - Merge branch 'Base64-AVX2' of https://github.com/asgibbons/jdk into Base64-AVX2
>  - Merge branch 'openjdk:master' into Base64-AVX2
>  - Address review comment
>  - Remove whitespace
>  - Fix wrong register usage
>  - Working version of Base64 decode with AVX2 (4x perf improvement). No URL support
>  - Merge branch 'Base64-AVX2' of https://github.com/asgibbons/jdk into Base64-AVX2
>  - ... and 3 more: https://git.openjdk.org/jdk/compare/50dc1196...3e66f7be

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 2159:

> 2157: }
> 2158: 
> 2159: address StubGenerator::base64_AVX2_tables_addr() {

A casual reader might wonder why there's 3 other avx2-tables and then this, so for readability it'd be nice to add "decode" to the name of this new table.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 2643:

> 2641:     // Handle isURL / MIME?!?!  r12 is used for length calculation (from out)
> 2642:     //
> 2643:     // rbx is out, r12 is saved out, rdx is size, rsi is src

It seems that on windows `r12` is in use, see line 2323. GHA seem to be having some trouble finishing Windows testing on time - could there be some issue here?

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 2647:

> 2645:     ////////////////////////////////////////////
> 2646: 
> 2647:     // *************** NEEDS TO BE FIXED ************

While it's fine to leave implementation of `getMimeDecoder` and `getUrlDecoder` for a follow-up, I think these comments needs to be cleaned up. E.g. a follow-up RFE filed and referenced.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 2651:

> 2649:     __ jcc(Assembler::notZero, L_tailProc);
> 2650: 
> 2651:     __ cmpl(length, 44);

Perform `length` checks first to avoid unnecessary branches on small inputs?

Ideal might be to move this length check up just before the `_cmpl(length, 128)` in the AVX-512 block, so that if `AVX=3` short inputs branch directly to the scalar tail procedure without jumping around. This might also apply to the encode stub, though that's pre-existing.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 2673:

> 2671:     __ vmovdqu(xmm13, Address(r13, 0xc8));
> 2672:     __ vmovdqu(xmm12, Address(r13, 0x08));
> 2673:     __ jmp(L_enterLoop);

This got me curious (sorry!) and maybe there's something going on here that I'm not getting... But why have you split the loop apart and jump into the middle of it? It'd seem not doing this would be more straightforward, with no difference semantically and one less jump? (align32 should just translate to a narrow nop instruction, if anything)

-------------

PR: https://git.openjdk.org/jdk/pull/12126