RFR: 8323116: [REDO] Computational test more than 2x slower when AVX instructions are used [v4]

Fri Apr 5 02:34:10 UTC 2024

On Fri, 5 Apr 2024 00:09:00 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> Similar to #18089, the purpose of this change is to remove the slowdown due to false dependency. For example, using the current `(dst, dst, src)` encoding in the case of `VCVTSD2SS xmm1, xmm2, xmm3/m64`, the instruction converts one double precision floating-point value in xmm3/m64 to one single precision floating-point value and **merge with high bits in xmm2**. This merge with high bits of xmm2 causes a false dependency as xmm1 and xmm2 are the same in `(dst, dst, src)` encoding.
>> 
>> We are removing the false dependency by (1) removing the m64 source in VCVTSDSS instruction encoding in the .ad file (2) load `m64` source in `src` before calling `VCVTSD2SS `and explicitly zeroing out the of high bits in `src` using `vmovsd  src, m64` and then calling `VCVTSD2SS dst, src, src`. Thus `dst[0:63]` now gets the result of convert operation from `src[0:63]` and since` src[64:127]` is already zeroed out, it's put in `dst[64:127] `without a false dependency.
>> 
>> Thanks,
>> Vamsi
>
> Thank you for explaining.

> Similar to #18089, the purpose of this change is to remove the slowdown due to false dependency. For example, using the current `(dst, dst, src)` encoding in the case of `VCVTSD2SS xmm1, xmm2, xmm3/m64`, the instruction converts one double precision floating-point value in xmm3/m64 to one single precision floating-point value and **merge with high bits in xmm2**. This merge with high bits of xmm2 causes a false dependency as xmm1 and xmm2 are the same in `(dst, dst, src)` encoding.
> 
> We are removing the false dependency by (1) removing the m64 source in VCVTSDSS instruction encoding in the .ad file (2) load `m64` source in `src` before calling `VCVTSD2SS `and explicitly zeroing out the of high bits in `src` using `vmovsd src, m64` and then calling `VCVTSD2SS dst, src, src`. Thus `dst[0:63]` now gets the result of convert operation from `src[0:63]` and since` src[64:127]` is already

Hi Vamsi,
This is a downcast from double precision to single precision value, thus only lower 32 bits of destination hold the actual results for conversion, upper 32 bits are copied from non destructive source operand and for vex encoded institution higher 128 bits is zerod out OR are preserved for REX encoded variant.

VCVTSD2SS (VEX.128 Encoded Version) ¶
DEST[31:0] := Convert_Double_Precision_To_Single_Precision_Floating_Point(SRC2[63:0]);
DEST[127:32] := SRC1[127:32]
DEST[MAXVL-1:128] := 0
CVTSD2SS (128-bit Legacy SSE Version) ¶
DEST[31:0] := Convert_Double_Precision_To_Single_Precision_Floating_Point(SRC[63:0]);
(* DEST[MAXVL-1:32] Unmodified *)

You change can lead to incorretness 

https://github.com/openjdk/jdk/blob/0b01144ecec1283adaaaf1a7f53d075a56f030ae/src/hotspot/cpu/x86/assembler_x86.cpp#L11764

>  zeroed out, it's put in `dst[64:127] `without a false dependency.
> 
> Thanks, Vamsi

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/18503#discussion_r1552692245