RFR: 8372980: [VectorAPI] AArch64: Add intrinsic support for unsigned min/max reduction operations [v2]

Tue Jan 20 10:01:31 UTC 2026

> This patch adds intrinsic support for UMIN and UMAX reduction operations in the Vector API on AArch64, enabling direct hardware instruction mapping for better performance.
> 
> Changes:
> --------
> 
> 1. C2 mid-end:
>    - Added UMinReductionVNode and UMaxReductionVNode
> 
> 2. AArch64 Backend:
>    - Added uminp/umaxp/sve_uminv/sve_umaxv instructions
>    - Updated match rules for all vector sizes and element types
>    - Both NEON and SVE implementation are supported
> 
> 3. Test:
>    - Added UMIN_REDUCTION_V and UMAX_REDUCTION_V to IRNode.java
>    - Added assembly tests in aarch64-asmtest.py for new instructions
>    - Added a JTReg test file VectorUMinMaxReductionTest.java
> 
> Different configurations were tested on aarch64 and x86 machines, and all tests passed.
> 
> Test results of JMH benchmarks from the panama-vector project:
> --------
> 
> On a Nvidia Grace machine with 128-bit SVE:
> 
> Benchmark                       Unit    Before  Error   After           Error   Uplift
> Byte128Vector.UMAXLanes         ops/ms  411.60  42.18   25226.51        33.92   61.29
> Byte128Vector.UMAXMaskedLanes   ops/ms  558.56  85.12   25182.90        28.74   45.09
> Byte128Vector.UMINLanes         ops/ms  645.58  780.76  28396.29        103.11  43.99
> Byte128Vector.UMINMaskedLanes   ops/ms  621.09  718.27  26122.62        42.68   42.06
> Byte64Vector.UMAXLanes          ops/ms  296.33  34.44   14357.74        15.95   48.45
> Byte64Vector.UMAXMaskedLanes    ops/ms  376.54  44.01   14269.24        21.41   37.90
> Byte64Vector.UMINLanes          ops/ms  373.45  426.51  15425.36        66.20   41.31
> Byte64Vector.UMINMaskedLanes    ops/ms  353.32  346.87  14201.37        13.79   40.19
> Int128Vector.UMAXLanes          ops/ms  174.79  192.51  9906.07         286.93  56.67
> Int128Vector.UMAXMaskedLanes    ops/ms  157.23  206.68  10246.77        11.44   65.17
> Int64Vector.UMAXLanes           ops/ms  95.30   126.49  4719.30         98.57   49.52
> Int64Vector.UMAXMaskedLanes     ops/ms  88.19   87.44   4693.18         19.76   53.22
> Long128Vector.UMAXLanes         ops/ms  80.62   97.82   5064.01         35.52   62.82
> Long128Vector.UMAXMaskedLanes   ops/ms  78.15   102.91  5028.24         8.74    64.34
> Long64Vector.UMAXLanes          ops/ms  47.56   62.01   46.76           52.28   0.98
> Long64Vector.UMAXMaskedLanes    ops/ms  45.44   46.76   45.79           42.91   1.01
> Short128Vector.UMAXLanes        ops/ms  316.65  410.30  14814.82        23.65   46.79
> Short128Vector.UMAXMaskedLanes  ops/ms  308.90  351.78  15155.26        31.03   49.06
> Sh...

Eric Fang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits:

 - Rebase commit 56d7b52
 - Merge branch 'master' into JDK-8372980-umin-umax-intrinsic
 - 8372980: [VectorAPI] AArch64: Add intrinsic support for unsigned min/max reduction operations

   This patch adds intrinsic support for UMIN and UMAX reduction operations
   in the Vector API on AArch64, enabling direct hardware instruction mapping
   for better performance.

   Changes:
   --------

   1. C2 mid-end:
      - Added UMinReductionVNode and UMaxReductionVNode

   2. AArch64 Backend:
      - Added uminp/umaxp/sve_uminv/sve_umaxv instructions
      - Updated match rules for all vector sizes and element types
      - Both NEON and SVE implementation are supported

   3. Test:
      - Added UMIN_REDUCTION_V and UMAX_REDUCTION_V to IRNode.java
      - Added assembly tests in aarch64-asmtest.py for new instructions
      - Added a JTReg test file VectorUMinMaxReductionTest.java

   Different configurations were tested on aarch64 and x86 machines, and
   all tests passed.

   Test results of JMH benchmarks from the panama-vector project:
   --------

   On a Nvidia Grace machine with 128-bit SVE:
   ```
   Benchmark			Unit	Before	Error	After		Error	Uplift
   Byte128Vector.UMAXLanes		ops/ms	411.60	42.18	25226.51	33.92	61.29
   Byte128Vector.UMAXMaskedLanes	ops/ms	558.56	85.12	25182.90	28.74	45.09
   Byte128Vector.UMINLanes		ops/ms	645.58	780.76	28396.29	103.11	43.99
   Byte128Vector.UMINMaskedLanes	ops/ms	621.09	718.27	26122.62	42.68	42.06
   Byte64Vector.UMAXLanes		ops/ms	296.33	34.44	14357.74	15.95	48.45
   Byte64Vector.UMAXMaskedLanes	ops/ms	376.54	44.01	14269.24	21.41	37.90
   Byte64Vector.UMINLanes		ops/ms	373.45	426.51	15425.36	66.20	41.31
   Byte64Vector.UMINMaskedLanes	ops/ms	353.32	346.87	14201.37	13.79	40.19
   Int128Vector.UMAXLanes		ops/ms	174.79	192.51	9906.07		286.93	56.67
   Int128Vector.UMAXMaskedLanes	ops/ms	157.23	206.68	10246.77	11.44	65.17
   Int64Vector.UMAXLanes		ops/ms	95.30	126.49	4719.30		98.57	49.52
   Int64Vector.UMAXMaskedLanes	ops/ms	88.19	87.44	4693.18		19.76	53.22
   Long128Vector.UMAXLanes		ops/ms	80.62	97.82	5064.01		35.52	62.82
   Long128Vector.UMAXMaskedLanes	ops/ms	78.15	102.91	5028.24		8.74	64.34
   Long64Vector.UMAXLanes		ops/ms	47.56	62.01	46.76		52.28	0.98
   Long64Vector.UMAXMaskedLanes	ops/ms	45.44	46.76	45.79		42.91	1.01
   Short128Vector.UMAXLanes	ops/ms	316.65	410.30	14814.82	23.65	46.79
   Short128Vector.UMAXMaskedLanes	ops/ms	308.90	351.78	15155.26	31.03	49.06
   Short64Vector.UMAXLanes		ops/ms	190.38	245.09	8022.46		14.30	42.14
   Short64Vector.UMAXMaskedLanes	ops/ms	195.54	36.15	7930.28		11.88	40.56
   ```

   On a Nvidia Grace machine with 128-bit NEON:
   ```
   Benchmark			Unit	Before	Error	After		Error	Uplift
   Byte128Vector.UMAXLanes		ops/ms	414.69	42.52	25257.61	25.91	60.91
   Byte128Vector.UMAXMaskedLanes	ops/ms	552.00	56.61	23063.14	304.45	41.78
   Byte128Vector.UMINLanes		ops/ms	634.98	849.04	28444.37	180.80	44.80
   Byte128Vector.UMINMaskedLanes	ops/ms	612.88	735.18	26127.07	27.99	42.63
   Byte64Vector.UMAXLanes		ops/ms	291.53	32.19	13893.62	28.09	47.66
   Byte64Vector.UMAXMaskedLanes	ops/ms	363.34	48.17	13290.59	12.53	36.58
   Byte64Vector.UMINLanes		ops/ms	368.70	433.60	15416.90	15.80	41.81
   Byte64Vector.UMINMaskedLanes	ops/ms	350.46	371.05	14524.29	121.63	41.44
   Int128Vector.UMAXLanes		ops/ms	177.67	201.38	10182.82	20.21	57.31
   Int128Vector.UMAXMaskedLanes	ops/ms	155.25	187.88	9194.13		393.35	59.22
   Int64Vector.UMAXLanes		ops/ms	93.93	115.02	5106.79		4.54	54.37
   Int64Vector.UMAXMaskedLanes	ops/ms	87.01	88.50	4405.87		8.06	50.63
   Long128Vector.UMAXLanes		ops/ms	80.32	98.50	3229.80		40.53	40.21
   Long128Vector.UMAXMaskedLanes	ops/ms	77.65	103.25	3161.50		4.45	40.72
   Long64Vector.UMAXLanes		ops/ms	47.72	65.38	46.41		50.38	0.97
   Long64Vector.UMAXMaskedLanes	ops/ms	45.26	47.46	45.13		47.23	1.00
   Short128Vector.UMAXLanes	ops/ms	316.09	429.34	14748.07	14.78	46.66
   Short128Vector.UMAXMaskedLanes	ops/ms	307.70	342.54	14359.11	44.99	46.67
   Short64Vector.UMAXLanes		ops/ms	187.67	253.01	8180.63		178.65	43.59
   Short64Vector.UMAXMaskedLanes	ops/ms	191.10	33.51	7949.19		108.65	41.60
   ```
 - 8372978: [VectorAPI] Fix incorrect identity values in UMIN/UMAX reductions

   The original implementation of UMIN/UMAX reductions in JDK-8346174
   used incorrect identity values in the Java implementation and test code.

   Problem:
   --------
   UMIN was using MAX_OR_INF (signed maximum value) as the identity:
     - Byte.MAX_VALUE (127) instead of max unsigned byte (255)
     - Short.MAX_VALUE (32767) instead of max unsigned short (65535)
     - Integer.MAX_VALUE instead of max unsigned int (-1)
     - Long.MAX_VALUE instead of max unsigned long (-1)

   UMAX was using MIN_OR_INF (signed minimum value) as the identity:
     - Byte.MIN_VALUE (-128) instead of 0
     - Short.MIN_VALUE (-32768) instead of 0
     - Integer.MIN_VALUE instead of 0
     - Long.MIN_VALUE instead of 0

   This caused incorrect result. For example:
     UMAX([42,42,...,42]) returned 128 instead of 42

   Solution:
   ---------
   Use correct unsigned identity values:
     - UMIN: ($type$)-1 (maximum unsigned value)
     - UMAX: ($type$)0 (minimum unsigned value)

   Changes:
   --------
   - X-Vector.java.template: Fixed identity values in reductionOperations
   - gen-template.sh: Fixed identity values for test code generation
   - templates/Unit-header.template: Updated copyright year to 2025
   - Regenerated all Vector classes and test files

   Testing:
   --------
   All types (byte/short/int/long) now return correct results in both
   interpreter mode (-Xint) and compiled mode.

-------------

Changes: https://git.openjdk.org/jdk/pull/28693/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=28693&range=01
  Stats: 1101 lines in 12 files changed: 685 ins; 16 del; 400 mod
  Patch: https://git.openjdk.org/jdk/pull/28693.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/28693/head:pull/28693

PR: https://git.openjdk.org/jdk/pull/28693