RFR: 8268276: Base64 Decoding optimization for x86 using AVX-512 [v3]

Wed Jun 9 00:59:19 UTC 2021

On Tue, 8 Jun 2021 00:30:38 GMT, Scott Gibbons <github.com+6704669+asgibbons at openjdk.org> wrote:

>> Add the Base64 Decode intrinsic for x86 to utilize AVX-512 for acceleration. Also allows for performance improvement for non-AVX-512 enabled platforms. Due to the nature of MIME-encoded inputs, modify the intrinsic signature to accept an additional parameter (isMIME) for fast-path MIME decoding.
>> 
>> A change was made to the signature of DecodeBlock in Base64.java to provide the intrinsic information as to whether MIME decoding was being done.  This allows for the intrinsic to bypass the expensive setup of zmm registers from AVX tables, knowing there may be invalid Base64 characters every 76 characters or so.  A change was also made here removing the restriction that the intrinsic must return an even multiple of 3 bytes decoded.  This implementation handles the pad characters at the end of the string and will return the actual number of characters decoded.
>> 
>> The AVX portion of this code will decode in blocks of 256 bytes per loop iteration, then in chunks of 64 bytes, followed by end fixup decoding.  The non-AVX code is an assembly-optimized version of the java DecodeBlock and behaves identically.
>> 
>> Running the Base64Decode benchmark, this change increases decode performance by an average of 2.6x with a maximum 19.7x for buffers > ~20k.  The numbers are given in the table below.
>> 
>> **Base Score** is without intrinsic support, **Optimized Score** is using this intrinsic, and **Gain** is **Base** / **Optimized**.
>> 
>> 
>> Benchmark Name | Base Score | Optimized Score | Gain
>> -- | -- | -- | --
>> testBase64Decode size 1 | 15.36 | 15.32 | 1.00
>> testBase64Decode size 3 | 17.00 | 16.72 | 1.02
>> testBase64Decode size 7 | 20.60 | 18.82 | 1.09
>> testBase64Decode size 32 | 34.21 | 26.77 | 1.28
>> testBase64Decode size 64 | 54.43 | 38.35 | 1.42
>> testBase64Decode size 80 | 66.40 | 48.34 | 1.37
>> testBase64Decode size 96 | 73.16 | 52.90 | 1.38
>> testBase64Decode size 112 | 84.93 | 51.82 | 1.64
>> testBase64Decode size 512 | 288.81 | 32.04 | 9.01
>> testBase64Decode size 1000 | 560.48 | 40.79 | 13.74
>> testBase64Decode size 20000 | 9530.28 | 483.37 | 19.72
>> testBase64Decode size 50000 | 24552.24 | 1735.07 | 14.15
>> testBase64MIMEDecode size 1 | 22.87 | 21.36 | 1.07
>> testBase64MIMEDecode size 3 | 27.79 | 25.32 | 1.10
>> testBase64MIMEDecode size 7 | 44.74 | 43.81 | 1.02
>> testBase64MIMEDecode size 32 | 142.69 | 129.56 | 1.10
>> testBase64MIMEDecode size 64 | 256.90 | 243.80 | 1.05
>> testBase64MIMEDecode size 80 | 311.60 | 310.80 | 1.00
>> testBase64MIMEDecode size 96 | 364.00 | 346.66 | 1.05
>> testBase64MIMEDecode size 112 | 472.88 | 394.78 | 1.20
>> testBase64MIMEDecode size 512 | 1814.96 | 1671.28 | 1.09
>> testBase64MIMEDecode size 1000 | 3623.50 | 3227.61 | 1.12
>> testBase64MIMEDecode size 20000 | 70484.09 | 64940.77 | 1.09
>> testBase64MIMEDecode size 50000 | 191732.34 | 158158.95 | 1.21
>> testBase64WithErrorInputsDecode size 1 | 1531.02 | 1185.19 | 1.29
>> testBase64WithErrorInputsDecode size 3 | 1306.59 | 1170.99 | 1.12
>> testBase64WithErrorInputsDecode size 7 | 1238.11 | 1176.62 | 1.05
>> testBase64WithErrorInputsDecode size 32 | 1346.46 | 1138.47 | 1.18
>> testBase64WithErrorInputsDecode size 64 | 1195.28 | 1172.52 | 1.02
>> testBase64WithErrorInputsDecode size 80 | 1469.00 | 1180.94 | 1.24
>> testBase64WithErrorInputsDecode size 96 | 1434.48 | 1167.74 | 1.23
>> testBase64WithErrorInputsDecode size 112 | 1440.06 | 1162.56 | 1.24
>> testBase64WithErrorInputsDecode size 512 | 1362.79 | 1193.42 | 1.14
>> testBase64WithErrorInputsDecode size 1000 | 1426.07 | 1194.44 | 1.19
>> testBase64WithErrorInputsDecode size   20000 | 1398.44 | 1138.17 | 1.23
>> testBase64WithErrorInputsDecode size   50000 | 1409.41 | 1114.16 | 1.26
>
> Scott Gibbons has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fixing review comments.  Adding notes about isMIME parameter for other architectures; clarifying decodeBlock comments.

@asgibbons Thanks a lot for contributing this. The performance gain is impressive. I have some minor comments. Please take a look.

src/hotspot/cpu/x86/assembler_x86.cpp line 4555:

> 4553: void Assembler::evpmaddubsw(XMMRegister dst, XMMRegister src1, XMMRegister src2, int vector_len) {
> 4554:   assert(VM_Version::supports_avx512bw(), "");
> 4555:   InstructionAttr attributes(vector_len, /* rex_w */ false, /* legacy_mode */ _legacy_mode_bw, /* no_mask_reg */ true, /* uses_vl */ true);

This instruction is also supported on AVX platforms. The assert check could be as follows:
  assert(vector_len == AVX_128bit? VM_Version::supports_avx() :
             vector_len == AVX_256bit? VM_Version::supports_avx2() :
             vector_len == AVX_512bit? VM_Version::supports_avx512bw() : 0, "");
Accordingly the instruction could be named as vpmaddubsw.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5688:

> 5686:   address base64_vbmi_lookup_lo_addr() {
> 5687:     __ align(64, (unsigned long) __ pc());
> 5688:     StubCodeMark mark(this, "StubRoutines", "lookup_lo");

It will be good to add base64 to the StubCodeMark name for this and all the tables.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5983:

> 5981:     // calculate length from offsets
> 5982:     __ movq(length, end_offset);
> 5983:     __ subq(length, start_offset);

These are 32bit, so movl, subl instead of movq, subq. Similar for all length relates instructions below.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 5987:

> 5985: 
> 5986:     // If AVX512 VBMI not supported, just compile non-AVX code
> 5987:     if(VM_Version::supports_avx512_vbmi()) {

Need to also check for VM_Version::supports_avx512bw() support.
Could you please check if VM_Version::supports_avx512dq is needed as well?

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6134:

> 6132:       __ subq(length, 64);
> 6133:       __ addq(source, 64);
> 6134:       __ addq(dest, 48);

All address related instructions here and below could use addptr, subptr etc.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6273:

> 6271: 
> 6272:     __ shrq(length, 2);    // Multiple of 4 bytes only - length is # 4-byte chunks
> 6273:     __ cmpq(length, 0);

Should these be shrl, cmpl?

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6278:

> 6276:     // Set up src and dst pointers properly
> 6277:     __ addq(source, start_offset);     // Initial offset
> 6278:     __ addq(dest, dp);

The convention is to use addptr for pointers.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6284:

> 6282:     __ shll(isURL, 8);    // index into decode table based on isURL
> 6283:     __ lea(decode_table, ExternalAddress(StubRoutines::x86::base64_decoding_table_addr()));
> 6284:     __ addq(decode_table, isURL);

addptr here.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6297:

> 6295:     __ orl(byte1, byte4);
> 6296: 
> 6297:     __ incrementq(source, 4);

addptr here.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6317:

> 6315:     __ load_signed_byte(byte4, Address(source, RegisterOrConstant(), Address::times_1, 3));
> 6316:     __ load_signed_byte(byte3, Address(decode_table, byte3, Address::times_1, 0));
> 6317:     __ load_signed_byte(byte4, Address(decode_table, byte4, Address::times_1, 0));

You could use Address(base, offset) form directly here and other places: e.g. Address (source, 1) instead of Address(source, RegisterOrConstant(), Address::times_1, 1).

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 6329:

> 6327:     __ subq(dest, rax);      // Number of bytes converted
> 6328:     __ movq(rax, dest);
> 6329:     __ pop(rbx);

subptr, movptr here.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 7627:

> 7625:       StubRoutines::x86::_right_shift_mask = base64_right_shift_mask_addr();
> 7626:       StubRoutines::_base64_encodeBlock = generate_base64_encodeBlock();
> 7627:       if (VM_Version::supports_avx512_vbmi()) {

Need to add avx512bw check here also.

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 7628:

> 7626:       StubRoutines::_base64_encodeBlock = generate_base64_encodeBlock();
> 7627:       if (VM_Version::supports_avx512_vbmi()) {
> 7628:         StubRoutines::x86::_lookup_lo = base64_vbmi_lookup_lo_addr();

It would be good to add base64 to these names.

-------------

PR: https://git.openjdk.java.net/jdk/pull/4368