RFR: 8372980: [VectorAPI] AArch64: Add intrinsic support for unsigned min/max reduction operations [v4]

Tue Jan 27 18:41:13 UTC 2026

On Mon, 26 Jan 2026 09:26:35 GMT, Eric Fang <erfang at openjdk.org> wrote:

>> This patch adds intrinsic support for UMIN and UMAX reduction operations in the Vector API on AArch64, enabling direct hardware instruction mapping for better performance.
>> 
>> Changes:
>> --------
>> 
>> 1. C2 mid-end:
>>    - Added UMinReductionVNode and UMaxReductionVNode
>> 
>> 2. AArch64 Backend:
>>    - Added uminp/umaxp/sve_uminv/sve_umaxv instructions
>>    - Updated match rules for all vector sizes and element types
>>    - Both NEON and SVE implementation are supported
>> 
>> 3. Test:
>>    - Added UMIN_REDUCTION_V and UMAX_REDUCTION_V to IRNode.java
>>    - Added assembly tests in aarch64-asmtest.py for new instructions
>>    - Added a JTReg test file VectorUMinMaxReductionTest.java
>> 
>> Different configurations were tested on aarch64 and x86 machines, and all tests passed.
>> 
>> Test results of JMH benchmarks from the panama-vector project:
>> --------
>> 
>> On a Nvidia Grace machine with 128-bit SVE:
>> 
>> Benchmark                       Unit    Before  Error   After           Error   Uplift
>> Byte128Vector.UMAXLanes         ops/ms  411.60  42.18   25226.51        33.92   61.29
>> Byte128Vector.UMAXMaskedLanes   ops/ms  558.56  85.12   25182.90        28.74   45.09
>> Byte128Vector.UMINLanes         ops/ms  645.58  780.76  28396.29        103.11  43.99
>> Byte128Vector.UMINMaskedLanes   ops/ms  621.09  718.27  26122.62        42.68   42.06
>> Byte64Vector.UMAXLanes          ops/ms  296.33  34.44   14357.74        15.95   48.45
>> Byte64Vector.UMAXMaskedLanes    ops/ms  376.54  44.01   14269.24        21.41   37.90
>> Byte64Vector.UMINLanes          ops/ms  373.45  426.51  15425.36        66.20   41.31
>> Byte64Vector.UMINMaskedLanes    ops/ms  353.32  346.87  14201.37        13.79   40.19
>> Int128Vector.UMAXLanes          ops/ms  174.79  192.51  9906.07         286.93  56.67
>> Int128Vector.UMAXMaskedLanes    ops/ms  157.23  206.68  10246.77        11.44   65.17
>> Int64Vector.UMAXLanes           ops/ms  95.30   126.49  4719.30         98.57   49.52
>> Int64Vector.UMAXMaskedLanes     ops/ms  88.19   87.44   4693.18         19.76   53.22
>> Long128Vector.UMAXLanes         ops/ms  80.62   97.82   5064.01         35.52   62.82
>> Long128Vector.UMAXMaskedLanes   ops/ms  78.15   102.91  5028.24         8.74    64.34
>> Long64Vector.UMAXLanes          ops/ms  47.56   62.01   46.76           52.28   0.98
>> Long64Vector.UMAXMaskedLanes    ops/ms  45.44   46.76   45.79           42.91   1.01
>> Short128Vector.UMAXLanes        ops/ms  316.65  410.30  14814.82        23.65   46.79
>> ...
>
> Eric Fang has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Move helper functions into c2_MacroAssembler_aarch64.hpp

The general way code flows right now, but not often, is from jdk/master to panama-vector/vectorIntrinsics, since most of the development work is in the mainline (exceptions to that are the float16 and Valhalla alignment work which are large efforts).

I am very reluctant to include all the auto-generated micro benchmarks in mainline. There is a huge number of them and i am not certain they provide as much value as they did now we have the IR test framework. In may cases, given the simplicity of what they measure, they were designed to ensure C2 generates the right instructions. The IR test framework is better at determining that by testing the right IR nodes are generated - and they get run as part of the existing HotSpot test suite.

The IR test framework is of course no substitute, in general, for performance tests. A better focus for Vector API performance tests is i think Emanuel's work [here](https://github.com/openjdk/jdk/pull/28639/) and use-cases/algorithms that can be implemented concisely.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/28693#issuecomment-3806851359