RFR: 8325991: Accelerate Poly1305 on x86_64 using AVX2 instructions [v3]

Jatin Bhateja jbhateja at openjdk.org
Tue Feb 20 00:11:57 UTC 2024


On Thu, 15 Feb 2024 20:09:06 GMT, Srinivas Vamsi Parasa <duke at openjdk.org> wrote:

>> The goal of this PR is to accelerate the Poly1305 algorithm using AVX2 instructions (including IFMA) for x86_64 CPUs.
>> 
>> This implementation is directly based on the AVX2 Poly1305 hash computation as implemented in Intel(R) Multi-Buffer Crypto for IPsec Library (url: https://github.com/intel/intel-ipsec-mb/blob/main/lib/avx2_t3/poly_fma_avx2.asm)
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision:
> 
>   change overloaded C to use COEFF

src/hotspot/cpu/x86/assembler_x86.cpp line 5150:

> 5148: 
> 5149: void Assembler::vpmadd52luq(XMMRegister dst, XMMRegister src1, Address src2, bool merge, int vector_len) {
> 5150:   assert(VM_Version::supports_avxifma(), "");

Please add an assertion for vector_len to be either 128 or 256 bit.

src/hotspot/cpu/x86/assembler_x86.cpp line 5152:

> 5150:   assert(VM_Version::supports_avxifma(), "");
> 5151:   InstructionMark im(this);
> 5152:   InstructionAttr attributes(vector_len, /* rex_w */ true, /* legacy_mode */ false, /* no_mask_reg */ false, /* uses_vl */ true);

uses_vl should be false, its a VEX encoded instruction.

src/hotspot/cpu/x86/assembler_x86.cpp line 5155:

> 5153:   if (merge) {
> 5154:     attributes.reset_is_clear_context();
> 5155:   }

As of now, merge semantics are only applicable to AVX512 instructions accepting an opmask register.

src/hotspot/cpu/x86/assembler_x86.cpp line 5167:

> 5165: 
> 5166: void Assembler::vpmadd52luq(XMMRegister dst, XMMRegister src1, XMMRegister src2, bool merge, int vector_len) {
> 5167:   assert(VM_Version::supports_avxifma(), "");

assertion for vector lengths, same as above.

src/hotspot/cpu/x86/assembler_x86.cpp line 5221:

> 5219:     attributes.reset_is_clear_context();
> 5220:   }
> 5221: 

Above comments applicable to this routine also.

src/hotspot/cpu/x86/stubGenerator_x86_64_poly.cpp line 1367:

> 1365: 
> 1366:   // VECTOR LOOP: process 4 * 16-byte message blocks at a time
> 1367:   __ bind(L_process256Loop);

Add appropriate alignment at the beginning of vector loop.

src/hotspot/cpu/x86/vm_version_x86.hpp line 302:

> 300:     uint32_t value;
> 301:   };
> 302: 

We can avoid creating additional structures for SefCpuid7Ecx1Ebx, SefCpuid7Ecx1Ecx and SefCpuid7Ecx1Edx. They occupy space over stack and none of their bits are being used.

src/hotspot/cpu/x86/vm_version_x86.hpp line 482:

> 480:     SefCpuid7Ecx1Ebx sef_cpuid7_ecx1_ebx;
> 481:     SefCpuid7Ecx1Ecx sef_cpuid7_ecx1_ecx;
> 482:     SefCpuid7Ecx1Edx sef_cpuid7_ecx1_edx;

Please remove SefCpuid7Ecx1Ebx, SefCpuid7Ecx1Ecx and SefCpuid7Ecx1Edx field definitions.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1493768592
PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1493768466
PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1493768841
PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1493769528
PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1493770005
PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1495089745
PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1493760215
PR Review Comment: https://git.openjdk.org/jdk/pull/17881#discussion_r1493760487


More information about the hotspot-compiler-dev mailing list