RFR: 8323116: [REDO] Computational test more than 2x slower when AVX instructions are used [v4]

Thu Apr 4 23:45:09 UTC 2024

On Wed, 3 Apr 2024 21:17:22 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   fix L2F cvtsi2ssq
>
> src/hotspot/cpu/x86/assembler_x86.cpp line 2034:
> 
>> 2032:   InstructionAttr attributes(AVX_128bit, /* rex_w */ VM_Version::supports_evex(), /* legacy_mode */ false, /* no_mask_reg */ true, /* uses_vl */ false);
>> 2033:   attributes.set_rex_vex_w_reverted();
>> 2034:   int encode = simd_prefix_and_encode(dst, src, src, VEX_SIMD_F2, VEX_OPCODE_0F, &attributes);
> 
> Can you explain this change?

Similar to #18089, the purpose of this change is to remove the slowdown due to false dependency. For example, using the current `(dst, dst, src)` encoding in the case of `VCVTSD2SS xmm1, xmm2, xmm3/m64`, the instruction converts one double precision floating-point value in xmm3/m64 to one single precision floating-point value and **merge with high bits in xmm2**. This merge with high bits of xmm2 causes a false dependency as xmm1 and xmm2 are the same in `(dst, dst, src)` encoding.

We are removing the false dependency by (1) removing the m64 source in VCVTSDSS instruction encoding in the .ad file (2) load `m64` source in `src` before calling `VCVTSD2SS `and explicitly zeroing out the of high bits in `src` using `vmovsd  src, m64` and then calling `VCVTSD2SS dst, src, src`. Thus `dst[0:63]` now gets the result of convert operation from `src[0:63]` and since` src[64:127]` is already zeroed out, it's put in `dst[64:127] `without a false dependency.

Thanks,
Vamsi

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/18503#discussion_r1552592854