RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v5]

Sat Jun 19 00:49:38 UTC 2021

On Fri, 18 Jun 2021 22:12:11 GMT, Scott Gibbons <github.com+6704669+asgibbons at openjdk.org> wrote:

>> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. Also allows for performance improvement for non-AVX-512 enabled platforms. Due to the nature of MIME-encoded inputs, modify the intrinsic signature to accept an additional parameter (isMIME) for fast-path MIME decoding.
>> 
>> A change was made to the signature of DecodeBlock in Base64.java to provide the intrinsic information as to whether MIME decoding was being done.  This allows for the intrinsic to bypass the expensive setup of zmm registers from AVX tables, knowing there may be invalid Base64 characters every 76 characters or so.  A change was also made here removing the restriction that the intrinsic must return an even multiple of 3 bytes decoded.  This implementation handles the pad characters at the end of the string and will return the actual number of characters decoded.
>> 
>> The AVX portion of this code will decode in blocks of 256 bytes per loop iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The non-AVX code is an assembly-optimized version of the java DecodeBlock and behaves identically.
>> 
>> Running the Base64Decode benchmark, this change increases decode performance by an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers are given in the table below.
>> 
>> **Base Score** is without intrinsic support, **Optimized Score** is using this intrinsic, and **Gain** is **Base** / **Optimized**.
>> 
>> 
>> Benchmark Name | Base Score | Optimized Score | Gain
>> -- | -- | -- | --
>> testBase64Decode size 1 | 15.36 | 15.32 | 1.00
>> testBase64Decode size 3 | 17.00 | 16.72 | 1.02
>> testBase64Decode size 7 | 20.60 | 18.82 | 1.09
>> testBase64Decode size 32 | 34.21 | 26.77 | 1.28
>> testBase64Decode size 64 | 54.43 | 38.35 | 1.42
>> testBase64Decode size 80 | 66.40 | 48.34 | 1.37
>> testBase64Decode size 96 | 73.16 | 52.90 | 1.38
>> testBase64Decode size 112 | 84.93 | 51.82 | 1.64
>> testBase64Decode size 512 | 288.81 | 32.04 | 9.01
>> testBase64Decode size 1000 | 560.48 | 40.79 | 13.74
>> testBase64Decode size 20000 | 9530.28 | 483.37 | 19.72
>> testBase64Decode size 50000 | 24552.24 | 1735.07 | 14.15
>> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07
>> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10
>> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02
>> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10
>> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05
>> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00
>> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05
>> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20
>> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09
>> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12
>> testBase64MIMEDecode size 20000 | 70484.09 | 64940.77 | 1.09
>> testBase64MIMEDecode size 50000 | 191732.34 | 158158.95 | 1.21
>> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29
>> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12
>> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05
>> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18
>> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02
>> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24
>> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23
>> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24
>> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14
>> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
>> testBase64WithErrorInputsDecode size   20000 | 1398.44 | 1138.17 | 1.23
>> testBase64WithErrorInputsDecode size   50000 | 1409.41 | 1114.16 | 1.26
>
> Scott Gibbons has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Added comments.  Streamlined flow for decode.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6004:

> 6002:       __ BIND(L_continue);
> 6003: 
> 6004:       __ vpxor(errorvec, errorvec, errorvec, Assembler::AVX_512bit);

Why clearing errorvec is needed here?

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6023:

> 6021:       __ evmovdquq(tmp16_op3, pack16_op, Assembler::AVX_512bit);
> 6022:       __ evmovdquq(tmp16_op2, pack16_op, Assembler::AVX_512bit);
> 6023:       __ evmovdquq(tmp16_op1, pack16_op, Assembler::AVX_512bit);

Why do we need 3 additional copies of pack16_op?

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6026:

> 6024:       __ evmovdquq(tmp32_op3, pack32_op, Assembler::AVX_512bit);
> 6025:       __ evmovdquq(tmp32_op2, pack32_op, Assembler::AVX_512bit);
> 6026:       __ evmovdquq(tmp32_op1, pack32_op, Assembler::AVX_512bit);

Why do we need 3 additional copies of pack32_op?

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6051:

> 6049:       __ vpternlogd(t0, 0xfe, input1, input2, Assembler::AVX_512bit);
> 6050: 
> 6051:       __ vpternlogd(t1, 0xfe, translated0, translated1, Assembler::AVX_512bit);

The comment here could be something like below for easy understanding:
// OR all of the inputs and translations together ...
// Here t0 has input0 and t1 has input3

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6055:

> 6053:       __ vpternlogd(t2, 0xfe, translated3, t0, Assembler::AVX_512bit);
> 6054:       __ evmovdquq(errorvec, t0, Assembler::AVX_512bit);
> 6055:       __ vpternlogd(errorvec, 0xfe, t1, t2, Assembler::AVX_512bit);

Could this be simplified as:
__ vpternlogd(t0, 0xfe, translated2, translated3, Assembler::AVX_512bit);
__ vpor(errorvec, t0, t1, Assembler::AVX_512bit);

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6060:

> 6058:       __ evpmovb2m(k3, errorvec, Assembler::AVX_512bit);
> 6059:       __ kortestql(k3, k3);
> 6060:       __ vpxor(errorvec, errorvec, errorvec, Assembler::AVX_512bit);

Why clearing errorvec is needed here? Seems to be not necessary for the 256 byte processing loop.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6069:

> 6067:       __ vpmaddubsw(merge_ab_bc0, translated0, tmp16_op3, Assembler::AVX_512bit);
> 6068:       __ vpmaddubsw(merge_ab_bc1, translated1, tmp16_op2, Assembler::AVX_512bit);
> 6069:       __ vpmaddubsw(merge_ab_bc2, translated2, tmp16_op1, Assembler::AVX_512bit);

Could we not use pack16_op directly here instead of its copies tmp16_*?

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6078:

> 6076:       __ vpmaddwd(merged0, merge_ab_bc0, tmp32_op2, Assembler::AVX_512bit);
> 6077:       __ vpmaddwd(merged1, merge_ab_bc1, tmp32_op1, Assembler::AVX_512bit);
> 6078:       __ vpmaddwd(merged2, merge_ab_bc2, tmp32_op3, Assembler::AVX_512bit);

Could we not use pack32_op directly here instead of its copies tmp32_*?

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6086:

> 6084:       __ evpermt2b(arr01, join01, merged1, Assembler::AVX_512bit);
> 6085:       __ evpermt2b(arr12, join12, merged2, Assembler::AVX_512bit);
> 6086:       __ evpermt2b(arr23, join23, merged3, Assembler::AVX_512bit);

arr01 is same as merged0, arr12 is same as merged1, arr23 is same as merged2.
So the above will be easy to understand if coded as below:
__ evpermt2b(merged0, join01, merged1, Assembler::AVX_512bit);
__ evpermt2b(merged1, join12, merged2, Assembler::AVX_512bit);
__ evpermt2b(merged2, join23, merged3, Assembler::AVX_512bit);

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6091:

> 6089:       __ evmovdquq(Address(dest, dp, Address::times_1, 0x00), arr01, Assembler::AVX_512bit);
> 6090:       __ evmovdquq(Address(dest, dp, Address::times_1, 0x40), arr12, Assembler::AVX_512bit);
> 6091:       __ evmovdquq(Address(dest, dp, Address::times_1, 0x80), arr23, Assembler::AVX_512bit);

Here you can directly used the merged0, merged1 and merged2 inplace of arr01, arr12 and arr23 respectively.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6122:

> 6120:       __ evpermt2b(translated0, input0, lookup_hi, Assembler::AVX_512bit);
> 6121: 
> 6122:       __ vpternlogd(errorvec, 0xfe, translated0, input0, Assembler::AVX_512bit);

This could be a simple vpor:
__ vpor(errorvec, translated0, input0, Assembler::AVX_512bit);

-------------

PR: https://git.openjdk.java.net/jdk/pull/4368