RFR: 8361582: AArch64: Some ConH values cannot be replicated with SVE [v7]
Bhavana Kilambi
bkilambi at openjdk.org
Tue Aug 19 08:21:54 UTC 2025
On Mon, 18 Aug 2025 13:08:06 GMT, Andrew Haley <aph at openjdk.org> wrote:
> There's something that I still do not understand.
>
> In your tests I see this:
>
> ```
> // For vectorizable loops containing FP16 operations with an FP16 constant as one of the inputs, the IR
> // node `(dst (Replicate con))` is generated to broadcast the constant into all lanes of an SVE register.
> // On SVE-capable hardware with vector length > 16B, if the FP16 immediate is a signed value within the
> // range [-128, 127] or a signed multiple of 256 in the range [-32768, 32512] for element widths of
> // 16 bits or higher then the backend should generate the "replicateHF_imm_gt128b" machnode.
> ```
>
> Why is this restricted to special constants? You should be able to do this with any value by generating `mov rtemp, #n; dup zn.h, rtemp`. There's no need to generate `mov rtemp, #n; fmov stemp, rtemp; dup zn.h, stemp`
This test does not test the `mov/fmov` instructions (only the `dup` instructions). The current code still generates `dup zn.h, #imm` for valid immediates and `dup zn.h, hn` for invalid immediates which is what is being tested in the JTREG testcase (as I only optimized`loadConH` and not the `replicateHF*` backend nodes)
For the binary FP16 add loop that I have in the testcase, the compiler generates the `loadConH` node which does a `mov rtemp, #n; fmov stemp, rtemp;` (as we discussed earlier) which gets consumed by a few scalar iterations of the loop (which expect the input to be in an FPR which is why we need the `fmov`). When the vectorized code for the loop is emitted eventually, the `dup` instruction is generated (either `dup zn.h, #imm` or `dup zn.h, hn`) which is what is being tested in this JTREG test. I feel it's better to keep the `dup` instructions separate for valid and invalid immediates because there could be cases where the immediate is a valid one and `loadConH` is not required to be generated (maybe there are no scalar iterations and it is a pure vector loop) in which case it would make sense to emit `dup Zn.h, #imm` instead.
Just to be clear, I am pasting the disassembly for the invalid case below -
0x0000e1e1a462c410: mov w8, #0x40b // #1035
0x0000e1e1a462c414: fmov s16, w8
....
0x0000e1e1a462c44c: ldrsh w15, [x14, #16]
0x0000e1e1a462c450: mov v17.h[0], w15
0x0000e1e1a462c454: fadd h18, h17, h16
0x0000e1e1a462c458: smov x29, v18.h[0]
....
0x0000e1e1a462c4e4: mov z17.h, p7/m, h16
....
0x0000e1e1a462c500: ld1h {z18.h}, p7/z, [x15]
0x0000e1e1a462c504: fadd z18.h, z18.h, z17.h
0x0000e1e1a462c508: add x13, x16, x13
0x0000e1e1a462c50c: add x15, x13, #0x10
0x0000e1e1a462c510: st1h {z18.h}, p7, [x15]
0x0000e1e1a462c514: add x15, x14, #0x30
0x0000e1e1a462c518: ld1h {z18.h}, p7/z, [x15]
0x0000e1e1a462c51c: fadd z18.h, z18.h, z17.h
0x0000e1e1a462c520: add x15, x13, #0x30
0x0000e1e1a462c524: st1h {z18.h}, p7, [x15]
.....
For the valid case -
0x0000ff120d02bf28: orr w8, wzr, #0x400
0x0000ff120d02bf2c: fmov s17, w8
...
0x0000ff120d02bf6c: ldrsh w14, [x13, #16]
0x0000ff120d02bf70: mov v16.h[0], w14
0x0000ff120d02bf74: fadd h18, h16, h17
0x0000ff120d02bf78: smov x13, v18.h[0]
0x0000ff120d02bf7c: add x10, x18, x10
0x0000ff120d02bf80: strh w13, [x10, #16]
....
0x0000ff120d02bfa4: mov z16.h, #1024
....
0x0000ff120d02bff0: ld1h {z18.h}, p7/z, [x13]
0x0000ff120d02bff4: fadd z18.h, z18.h, z16.h
0x0000ff120d02bff8: add x11, x18, x11
0x0000ff120d02bffc: add x13, x11, #0x10
0x0000ff120d02c000: st1h {z18.h}, p7, [x13]
0x0000ff120d02c004: add x13, x12, #0x30
0x0000ff120d02c008: ld1h {z18.h}, p7/z, [x13]
0x0000ff120d02c00c: fadd z18.h, z18.h, z16.h
0x0000ff120d02c010: add x13, x11, #0x30
0x0000ff120d02c014: st1h {z18.h}, p7, [x13]
....
-------------
PR Comment: https://git.openjdk.org/jdk/pull/26589#issuecomment-3199646701
More information about the hotspot-compiler-dev
mailing list