RFR: 8372980: [VectorAPI] AArch64: Add intrinsic support for unsigned min/max reduction operations [v2]

Tue Jan 20 19:28:04 UTC 2026

On Tue, 20 Jan 2026 10:01:31 GMT, Eric Fang <erfang at openjdk.org> wrote:

>> This patch adds intrinsic support for UMIN and UMAX reduction operations in the Vector API on AArch64, enabling direct hardware instruction mapping for better performance.
>> 
>> Changes:
>> --------
>> 
>> 1. C2 mid-end:
>>    - Added UMinReductionVNode and UMaxReductionVNode
>> 
>> 2. AArch64 Backend:
>>    - Added uminp/umaxp/sve_uminv/sve_umaxv instructions
>>    - Updated match rules for all vector sizes and element types
>>    - Both NEON and SVE implementation are supported
>> 
>> 3. Test:
>>    - Added UMIN_REDUCTION_V and UMAX_REDUCTION_V to IRNode.java
>>    - Added assembly tests in aarch64-asmtest.py for new instructions
>>    - Added a JTReg test file VectorUMinMaxReductionTest.java
>> 
>> Different configurations were tested on aarch64 and x86 machines, and all tests passed.
>> 
>> Test results of JMH benchmarks from the panama-vector project:
>> --------
>> 
>> On a Nvidia Grace machine with 128-bit SVE:
>> 
>> Benchmark                       Unit    Before  Error   After           Error   Uplift
>> Byte128Vector.UMAXLanes         ops/ms  411.60  42.18   25226.51        33.92   61.29
>> Byte128Vector.UMAXMaskedLanes   ops/ms  558.56  85.12   25182.90        28.74   45.09
>> Byte128Vector.UMINLanes         ops/ms  645.58  780.76  28396.29        103.11  43.99
>> Byte128Vector.UMINMaskedLanes   ops/ms  621.09  718.27  26122.62        42.68   42.06
>> Byte64Vector.UMAXLanes          ops/ms  296.33  34.44   14357.74        15.95   48.45
>> Byte64Vector.UMAXMaskedLanes    ops/ms  376.54  44.01   14269.24        21.41   37.90
>> Byte64Vector.UMINLanes          ops/ms  373.45  426.51  15425.36        66.20   41.31
>> Byte64Vector.UMINMaskedLanes    ops/ms  353.32  346.87  14201.37        13.79   40.19
>> Int128Vector.UMAXLanes          ops/ms  174.79  192.51  9906.07         286.93  56.67
>> Int128Vector.UMAXMaskedLanes    ops/ms  157.23  206.68  10246.77        11.44   65.17
>> Int64Vector.UMAXLanes           ops/ms  95.30   126.49  4719.30         98.57   49.52
>> Int64Vector.UMAXMaskedLanes     ops/ms  88.19   87.44   4693.18         19.76   53.22
>> Long128Vector.UMAXLanes         ops/ms  80.62   97.82   5064.01         35.52   62.82
>> Long128Vector.UMAXMaskedLanes   ops/ms  78.15   102.91  5028.24         8.74    64.34
>> Long64Vector.UMAXLanes          ops/ms  47.56   62.01   46.76           52.28   0.98
>> Long64Vector.UMAXMaskedLanes    ops/ms  45.44   46.76   45.79           42.91   1.01
>> Short128Vector.UMAXLanes        ops/ms  316.65  410.30  14814.82        23.65   46.79
>> ...
>
> Eric Fang has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits:
> 
>  - Rebase commit 56d7b52
>  - Merge branch 'master' into JDK-8372980-umin-umax-intrinsic
>  - 8372980: [VectorAPI] AArch64: Add intrinsic support for unsigned min/max reduction operations
>    
>    This patch adds intrinsic support for UMIN and UMAX reduction operations
>    in the Vector API on AArch64, enabling direct hardware instruction mapping
>    for better performance.
>    
>    Changes:
>    --------
>    
>    1. C2 mid-end:
>       - Added UMinReductionVNode and UMaxReductionVNode
>    
>    2. AArch64 Backend:
>       - Added uminp/umaxp/sve_uminv/sve_umaxv instructions
>       - Updated match rules for all vector sizes and element types
>       - Both NEON and SVE implementation are supported
>    
>    3. Test:
>       - Added UMIN_REDUCTION_V and UMAX_REDUCTION_V to IRNode.java
>       - Added assembly tests in aarch64-asmtest.py for new instructions
>       - Added a JTReg test file VectorUMinMaxReductionTest.java
>    
>    Different configurations were tested on aarch64 and x86 machines, and
>    all tests passed.
>    
>    Test results of JMH benchmarks from the panama-vector project:
>    --------
>    
>    On a Nvidia Grace machine with 128-bit SVE:
>    ```
>    Benchmark			Unit	Before	Error	After		Error	Uplift
>    Byte128Vector.UMAXLanes		ops/ms	411.60	42.18	25226.51	33.92	61.29
>    Byte128Vector.UMAXMaskedLanes	ops/ms	558.56	85.12	25182.90	28.74	45.09
>    Byte128Vector.UMINLanes		ops/ms	645.58	780.76	28396.29	103.11	43.99
>    Byte128Vector.UMINMaskedLanes	ops/ms	621.09	718.27	26122.62	42.68	42.06
>    Byte64Vector.UMAXLanes		ops/ms	296.33	34.44	14357.74	15.95	48.45
>    Byte64Vector.UMAXMaskedLanes	ops/ms	376.54	44.01	14269.24	21.41	37.90
>    Byte64Vector.UMINLanes		ops/ms	373.45	426.51	15425.36	66.20	41.31
>    Byte64Vector.UMINMaskedLanes	ops/ms	353.32	346.87	14201.37	13.79	40.19
>    Int128Vector.UMAXLanes		ops/ms	174.79	192.51	9906.07		286.93	56.67
>    Int128Vector.UMAXMaskedLanes	ops/ms	157.23	206.68	10246.77	11.44	65.17
>    Int64Vector.UMAXLanes		ops/ms	95.30	126.49	4719.30		98.57	49.52
>    Int64Vector.UMAXMaskedLanes	ops/ms	88.19	87.44	4693.18		19.76	53.22
>    Long128Vector.UMAXLanes		ops/ms	80.62	97.82	5064.01		35.52	62.82
>    Long128Vector.UMAXMaskedLanes	ops/ms	78.15	102.91	5028.24		8.74	64.34
>    Long64Vector.UMAXLanes		ops/ms	47.56	62.01	46.76		52.28	0.98
>    Long64Vector.UMAXMaskedLanes	ops/ms	45.44	46.76	45.79		42.91	1.01
>    Short128Vector.UMAXLanes	ops/ms	316.65	4...

I'm sorry, I _completely_ overthought that one. All you need are definitions for `min[vp]` and `max[vp]` in C2_Macroassembler.

Like so:

`void minv(bool is_unsigned, ...) { if (is_unsigned) { uminv(... } else { sminv(... } }`

No need to mess with class `Assembler`.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/28693#issuecomment-3774562567