RFR: 8338126 : C2 SuperWord: VectorCastF2HF / vcvtps2ph produces wrong results for vector length 2

Mon Oct 14 18:38:12 UTC 2024

On Mon, 14 Oct 2024 12:23:25 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> When Float.floatToFloat16 is vectorized using a 2-element vector width due to dependencies, we incorrectly generate a 4-element vcvtps2ph with memory as destination storing 8 bytes instead of desired 4 bytes.  This issue is fixed in this PR by limiting the memory version of match rule to 4-element vector and above.
>> Also a regression test case is added accordingly.
>> 
>> Best Regards,
>> Sandhya
>
> src/hotspot/cpu/x86/x86.ad line 3679:
> 
>> 3677: 
>> 3678: instruct vconvF2HF_mem_reg(memory mem, vec src) %{
>> 3679:   predicate(Matcher::vector_length_in_bytes(n->in(3)->in(1)) >= 16);
> 
> Ok, and so what alternative path is the matcher now going to take if we have `vector_length = 8`?

@eme64 It is now going to do the load from memory in register and use the register/register version for the conversion and then use the store to memory.  Please see before and after code snippets below.

Generated code snippet for 2 element float vector to float16 vector conversion
Before:
      vmovq  0x10(%rdx,%rbx,4),%xmm2              ; load 8 bytes from memory into xmm2 (correct)
      vcvtps2ph $0x4,%xmm2,0x10(%rsi,%rbx,2)  ; convert to float16 and store 8 bytes to memory (incorrect)

After:
      vmovq  0x10(%rdx,%rbx,4),%xmm15           ; load 8 bytes from memory into xmm15 (correct)
      vcvtps2ph $0x4,%xmm15,%xmm0               ; convert to float16 into register (correct) 
      vmovd  %xmm0,0x10(%rsi,%rbx,2)              ; store 4 byte into memory (correct)

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/21480#discussion_r1799938212