RFR: 8374349: [VectorAPI]: AArch64: Prefer merging mode SVE CPY instruction [v4]

Tue Feb 24 03:15:14 UTC 2026

> When optimizing some VectorMask related APIs , we found an optimization opportunity related to the `cpy (immediate, zeroing)` instruction [1]. Implementing the functionality of this instruction using `cpy (immediate, merging)` instruction [2] leads to better performance.
> 
> Currently the `cpy (imm, zeroing)` instruction is used in code generated by `VectorStoreMaskNode` and `VectorReinterpretNode`. Doing this optimization benefits all vector APIs that generate these two IRs potentially, such as `VectorMask.intoArray()` and `VectorMask.toLong()`.
> 
> Microbenchmarks show this change brings performance uplift ranging from **11%** to **33%**, depending on the specific operation and data types.
> 
> The specific changes in this PR:
> 1. Achieve the functionality of the `cpy (imm, zeroing)` instruction with the `movi + cpy (imm, merging)` instructions in assembler:
> 
> cpy  z17.d, p1/z, #1 =>
> 
> movi v17.2d, #0       // this instruction is zero cost
> cpy  z17.d, p1/m, #1
> 
> 
> 2. Add a new option `PreferSVEMergingModeCPY` to indicate whether to apply this optimization or not.
> - This option belongs to the Arch product category.
> - The default value is true on Neoverse-V1/V2 where the improvement has been confirmed, false on others.
> - When its value is true, the change is applied.
> 
> 3. Add a jtreg test to verify the behavior of this option.
> 
> This PR was tested on aarch64 and x86 machines with different configurations, and all tests passed.
> 
> JMH benchmarks:
> 
> On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2:
> 
> Benchmark	        	Unit	size	Before		Error	After		Error	Uplift
> byteIndexInRange		ops/ms	7.00	471816.15	1125.96	473237.77	1593.92	1.00
> byteIndexInRange		ops/ms	256.00	149654.21	416.57	149259.95	116.59	1.00
> byteIndexInRange		ops/ms	259.00	177850.31	991.13	179785.19	1110.07	1.01
> byteIndexInRange		ops/ms	512.00	133393.26	167.26	133484.61	281.83	1.00
> doubleIndexInRange		ops/ms	7.00	302176.39	12848.8	299813.02	37.76	0.99
> doubleIndexInRange		ops/ms	256.00	47831.93	56.70	46708.70	56.11	0.98
> doubleIndexInRange		ops/ms	259.00	11550.02	27.95	15333.50	10.40	1.33
> doubleIndexInRange		ops/ms	512.00	23687.76	61.65	23996.08	69.52	1.01
> floatIndexInRange		ops/ms	7.00	412195.79	124.71	411770.23	78.73	1.00
> floatIndexInRange		ops/ms	256.00	84479.98	70.69	84237.31	70.15	1.00
> floatIndexInRange		ops/ms	259.00	22585.65	80.07	28296.21	7.98	1.25
> floatIndexInRange		ops/ms	512.00	46902.99	51.60	46686.68	66.01	1.00
> intIndexInRange			ops/ms	7.00	413411.70	50.59	420684.66	253.55	1.02
> intIndexInRange			ops/...

Eric Fang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits:

 - Revert renaming sve_cpy as sve_cpy_optimized
 - Merge branch 'master' into JDK-8374349-sve-cpy-opt
 - Refine the code comments
 - Move the implementation into C2_MacroAssembler
 - 8374349: [VectorAPI]: AArch64: Prefer merging mode SVE CPY instruction

   When optimizing some VectorMask related APIs , we found an optimization
   opportunity related to the `cpy (immediate, zeroing)` instruction [1].
   Implementing the functionality of this instruction using `cpy (immediate,
   merging)` instruction [2] leads to better performance.

   Currently the `cpy (imm, zeroing)` instruction is used in code generated
   by `VectorStoreMaskNode` and `VectorReinterpretNode`. Doing this
   optimization benefits all vector APIs that generate these two IRs
   potentially, such as `VectorMask.intoArray()` and `VectorMask.toLong()`.

   Microbenchmarks show this change brings performance uplift ranging from
   **11%** to **33%**, depending on the specific operation and data types.

   The specific changes in this PR:
   1. Achieve the functionality of the `cpy (imm, zeroing)` instruction
   with the `movi + cpy (imm, merging)` instructions in assembler:
   ```
   cpy  z17.d, p1/z, #1 =>

   movi v17.2d, #0       // this instruction is zero cost
   cpy  z17.d, p1/m, #1
   ```

   2. Add a new option `PreferSVEMergingModeCPY` to indicate whether to
   apply this optimization or not.
   - This option belongs to the Arch product category.
   - The default value is true on Neoverse-V1/V2 where the improvement
     has been confirmed, false on others.
   - When its value is true, the change is applied.

   3. Add a jtreg test to verify the behavior of this option.

   This PR was tested on aarch64 and x86 machines with different
   configurations, and all tests passed.

   JMH benchmarks:

   On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2:
   ```
   Benchmark	        	Unit	size	Before		Error	After		Error	Uplift
   byteIndexInRange		ops/ms	7.00	471816.15	1125.96	473237.77	1593.92	1.00
   byteIndexInRange		ops/ms	256.00	149654.21	416.57	149259.95	116.59	1.00
   byteIndexInRange		ops/ms	259.00	177850.31	991.13	179785.19	1110.07	1.01
   byteIndexInRange		ops/ms	512.00	133393.26	167.26	133484.61	281.83	1.00
   doubleIndexInRange		ops/ms	7.00	302176.39	12848.8	299813.02	37.76	0.99
   doubleIndexInRange		ops/ms	256.00	47831.93	56.70	46708.70	56.11	0.98
   doubleIndexInRange		ops/ms	259.00	11550.02	27.95	15333.50	10.40	1.33
   doubleIndexInRange		ops/ms	512.00	23687.76	61.65	23996.08	69.52	1.01
   floatIndexInRange		ops/ms	7.00	412195.79	124.71	411770.23	78.73	1.00
   floatIndexInRange		ops/ms	256.00	84479.98	70.69	84237.31	70.15	1.00
   floatIndexInRange		ops/ms	259.00	22585.65	80.07	28296.21	7.98	1.25
   floatIndexInRange		ops/ms	512.00	46902.99	51.60	46686.68	66.01	1.00
   intIndexInRange			ops/ms	7.00	413411.70	50.59	420684.66	253.55	1.02
   intIndexInRange			ops/ms	256.00	84652.41	191.45	86758.74	193.66	1.02
   intIndexInRange			ops/ms	259.00	61825.20	291.71	62037.58	2355.43	1.00
   intIndexInRange			ops/ms	512.00	46754.89	149.72	46972.06	40.13	1.00
   longIndexInRange		ops/ms	7.00	329385.10	3292.7	318538.75	11103.9	0.97
   longIndexInRange		ops/ms	256.00	46910.36	53.41	46927.82	138.29	1.00
   longIndexInRange		ops/ms	259.00	33126.45	3210.07	32245.59	1347.58	0.97
   longIndexInRange		ops/ms	512.00	23931.64	215.55	23805.65	312.39	0.99
   shortIndexInRange		ops/ms	7.00	479265.67	1055.89	468452.89	433.15	0.98
   shortIndexInRange		ops/ms	256.00	138657.38	317.72	138695.29	505.69	1.00
   shortIndexInRange		ops/ms	259.00	113353.87	913.13	108912.75	1125.60	0.96
   shortIndexInRange		ops/ms	512.00	84652.74	171.37	84447.01	91.99	1.00
   ```

   On an AWS Graviton3 (Neoverse-V1) machine with 128-bit SVE1:
   ```
   Benchmark	        	Unit	size	Before		Error	After		Error	Uplift
   byteIndexInRange		ops/ms	7.00	320073.86	669.91	318557.87	1285.42	1.00
   byteIndexInRange		ops/ms	256.00	119246.71	43.13	120658.01	28.27	1.01
   byteIndexInRange		ops/ms	259.00	137664.23	12001.6	150378.59	70.41	1.09
   byteIndexInRange		ops/ms	512.00	97187.13	18.60	95356.43	78.60	0.98
   doubleIndexInRange		ops/ms	7.00	291076.68	603.08	287383.75	518.59	0.99
   doubleIndexInRange		ops/ms	256.00	57473.11	123.34	61559.58	687.21	1.07
   doubleIndexInRange		ops/ms	259.00	19396.73	40.03	22046.65	8.66	1.14
   doubleIndexInRange		ops/ms	512.00	33619.28	33.58	34715.40	157.72	1.03
   floatIndexInRange		ops/ms	7.00	317295.18	627.76	303857.78	465.78	0.96
   floatIndexInRange		ops/ms	256.00	91734.27	183.61	91851.31	394.35	1.00
   floatIndexInRange		ops/ms	259.00	38103.12	129.44	42237.38	92.17	1.11
   floatIndexInRange		ops/ms	512.00	57219.58	366.00	57769.07	264.71	1.01
   intIndexInRange			ops/ms	7.00	317063.25	830.81	304289.56	541.12	0.96
   intIndexInRange			ops/ms	256.00	91535.60	315.36	98143.40	142.44	1.07
   intIndexInRange			ops/ms	259.00	73827.89	472.28	73781.80	21.53	1.00
   intIndexInRange			ops/ms	512.00	57552.09	20.19	62348.87	37.45	1.08
   longIndexInRange		ops/ms	7.00	301886.14	381.89	301636.82	184.80	1.00
   longIndexInRange		ops/ms	256.00	62246.77	69.29	62093.75	88.72	1.00
   longIndexInRange		ops/ms	259.00	40642.36	861.47	41566.43	256.04	1.02
   longIndexInRange		ops/ms	512.00	34850.70	154.39	34884.42	149.17	1.00
   shortIndexInRange		ops/ms	7.00	318133.03	593.20	313469.12	528.73	0.99
   shortIndexInRange		ops/ms	256.00	105019.58	21.38	105014.90	21.81	1.00
   shortIndexInRange		ops/ms	259.00	116235.93	1985.27	118697.74	48.41	1.02
   shortIndexInRange		ops/ms	512.00	91981.84	166.84	91874.82	78.28	1.00
   ```

   [1] https://developer.arm.com/documentation/ddi0602/2025-06/SVE-Instructions/CPY--immediate--zeroing---Copy-signed-integer-immediate-to-vector-elements--zeroing--?lang=en
   [2] https://developer.arm.com/documentation/ddi0602/2025-12/SVE-Instructions/CPY--immediate--merging---Copy-signed-integer-immediate-to-vector-elements--merging--?lang=en

-------------

Changes: https://git.openjdk.org/jdk/pull/29359/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=29359&range=03
  Stats: 35 lines in 4 files changed: 31 ins; 0 del; 4 mod
  Patch: https://git.openjdk.org/jdk/pull/29359.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/29359/head:pull/29359

PR: https://git.openjdk.org/jdk/pull/29359