RFR: 8361582: AArch64: Some ConH values cannot be replicated with SVE [v7]

Tue Aug 19 08:21:54 UTC 2025

On Mon, 18 Aug 2025 13:08:06 GMT, Andrew Haley <aph at openjdk.org> wrote:

> There's something that I still do not understand.
> 
> In your tests I see this:
> 
> ```
>     // For vectorizable loops containing FP16 operations with an FP16 constant as one of the inputs, the IR
>     // node `(dst (Replicate con))` is generated to broadcast the constant into all lanes of an SVE register.
>     // On SVE-capable hardware with vector length > 16B, if the FP16 immediate is a signed value within the
>     // range [-128, 127] or a signed multiple of 256 in the range [-32768, 32512] for element widths of
>     // 16 bits or higher then the backend should generate the "replicateHF_imm_gt128b" machnode.
> ```
> 
> Why is this restricted to special constants? You should be able to do this with any value by generating `mov rtemp, #n; dup zn.h, rtemp`. There's no need to generate `mov rtemp, #n; fmov stemp, rtemp; dup zn.h, stemp`

This test does not test the `mov/fmov` instructions (only the `dup` instructions). The current code still generates `dup zn.h, #imm` for valid immediates and `dup zn.h, hn` for invalid immediates which is what is being tested in the JTREG testcase (as I only optimized`loadConH` and not the `replicateHF*` backend nodes)

For the binary FP16 add loop that I have in the testcase, the compiler generates the `loadConH` node which does a `mov rtemp, #n; fmov stemp, rtemp;` (as we discussed earlier) which gets consumed by a few scalar iterations of the loop (which expect the input to be in an FPR which is why we need the `fmov`). When the vectorized code for the loop is emitted eventually, the `dup` instruction is generated (either `dup zn.h, #imm` or `dup zn.h, hn`) which is what is being tested in this JTREG test. I feel it's better to keep the `dup` instructions separate for valid and invalid immediates because there could be cases where the immediate is a valid one and `loadConH` is not required to be generated (maybe there are no scalar iterations and it is a pure vector loop) in which case it would make sense to emit `dup Zn.h, #imm` instead. 

Just to be clear, I am pasting the disassembly for the invalid case below - 

  0x0000e1e1a462c410:   mov     w8, #0x40b                      // #1035
  0x0000e1e1a462c414:   fmov    s16, w8
....
  0x0000e1e1a462c44c:   ldrsh   w15, [x14, #16]
  0x0000e1e1a462c450:   mov     v17.h[0], w15
  0x0000e1e1a462c454:   fadd    h18, h17, h16
  0x0000e1e1a462c458:   smov    x29, v18.h[0]
....
  0x0000e1e1a462c4e4:   mov     z17.h, p7/m, h16
....
  0x0000e1e1a462c500:   ld1h    {z18.h}, p7/z, [x15]
  0x0000e1e1a462c504:   fadd    z18.h, z18.h, z17.h
  0x0000e1e1a462c508:   add     x13, x16, x13
  0x0000e1e1a462c50c:   add     x15, x13, #0x10
  0x0000e1e1a462c510:   st1h    {z18.h}, p7, [x15]
  0x0000e1e1a462c514:   add     x15, x14, #0x30
  0x0000e1e1a462c518:   ld1h    {z18.h}, p7/z, [x15]
  0x0000e1e1a462c51c:   fadd    z18.h, z18.h, z17.h
  0x0000e1e1a462c520:   add     x15, x13, #0x30
  0x0000e1e1a462c524:   st1h    {z18.h}, p7, [x15]
.....

For the valid case - 

  0x0000ff120d02bf28:   orr     w8, wzr, #0x400
  0x0000ff120d02bf2c:   fmov    s17, w8
...
  0x0000ff120d02bf6c:   ldrsh   w14, [x13, #16]
  0x0000ff120d02bf70:   mov     v16.h[0], w14
  0x0000ff120d02bf74:   fadd    h18, h16, h17
  0x0000ff120d02bf78:   smov    x13, v18.h[0]
  0x0000ff120d02bf7c:   add     x10, x18, x10
  0x0000ff120d02bf80:   strh    w13, [x10, #16]
....
  0x0000ff120d02bfa4:   mov     z16.h, #1024
....
  0x0000ff120d02bff0:   ld1h    {z18.h}, p7/z, [x13]
  0x0000ff120d02bff4:   fadd    z18.h, z18.h, z16.h
  0x0000ff120d02bff8:   add     x11, x18, x11
  0x0000ff120d02bffc:   add     x13, x11, #0x10
  0x0000ff120d02c000:   st1h    {z18.h}, p7, [x13]
  0x0000ff120d02c004:   add     x13, x12, #0x30
  0x0000ff120d02c008:   ld1h    {z18.h}, p7/z, [x13]
  0x0000ff120d02c00c:   fadd    z18.h, z18.h, z16.h
  0x0000ff120d02c010:   add     x13, x11, #0x30
  0x0000ff120d02c014:   st1h    {z18.h}, p7, [x13]
....

-------------

PR Comment: https://git.openjdk.org/jdk/pull/26589#issuecomment-3199646701