RFR: 8372980: [VectorAPI] AArch64: Add intrinsic support for unsigned min/max reduction operations [v4]
Paul Sandoz
psandoz at openjdk.org
Tue Jan 27 18:41:13 UTC 2026
On Mon, 26 Jan 2026 09:26:35 GMT, Eric Fang <erfang at openjdk.org> wrote:
>> This patch adds intrinsic support for UMIN and UMAX reduction operations in the Vector API on AArch64, enabling direct hardware instruction mapping for better performance.
>>
>> Changes:
>> --------
>>
>> 1. C2 mid-end:
>> - Added UMinReductionVNode and UMaxReductionVNode
>>
>> 2. AArch64 Backend:
>> - Added uminp/umaxp/sve_uminv/sve_umaxv instructions
>> - Updated match rules for all vector sizes and element types
>> - Both NEON and SVE implementation are supported
>>
>> 3. Test:
>> - Added UMIN_REDUCTION_V and UMAX_REDUCTION_V to IRNode.java
>> - Added assembly tests in aarch64-asmtest.py for new instructions
>> - Added a JTReg test file VectorUMinMaxReductionTest.java
>>
>> Different configurations were tested on aarch64 and x86 machines, and all tests passed.
>>
>> Test results of JMH benchmarks from the panama-vector project:
>> --------
>>
>> On a Nvidia Grace machine with 128-bit SVE:
>>
>> Benchmark Unit Before Error After Error Uplift
>> Byte128Vector.UMAXLanes ops/ms 411.60 42.18 25226.51 33.92 61.29
>> Byte128Vector.UMAXMaskedLanes ops/ms 558.56 85.12 25182.90 28.74 45.09
>> Byte128Vector.UMINLanes ops/ms 645.58 780.76 28396.29 103.11 43.99
>> Byte128Vector.UMINMaskedLanes ops/ms 621.09 718.27 26122.62 42.68 42.06
>> Byte64Vector.UMAXLanes ops/ms 296.33 34.44 14357.74 15.95 48.45
>> Byte64Vector.UMAXMaskedLanes ops/ms 376.54 44.01 14269.24 21.41 37.90
>> Byte64Vector.UMINLanes ops/ms 373.45 426.51 15425.36 66.20 41.31
>> Byte64Vector.UMINMaskedLanes ops/ms 353.32 346.87 14201.37 13.79 40.19
>> Int128Vector.UMAXLanes ops/ms 174.79 192.51 9906.07 286.93 56.67
>> Int128Vector.UMAXMaskedLanes ops/ms 157.23 206.68 10246.77 11.44 65.17
>> Int64Vector.UMAXLanes ops/ms 95.30 126.49 4719.30 98.57 49.52
>> Int64Vector.UMAXMaskedLanes ops/ms 88.19 87.44 4693.18 19.76 53.22
>> Long128Vector.UMAXLanes ops/ms 80.62 97.82 5064.01 35.52 62.82
>> Long128Vector.UMAXMaskedLanes ops/ms 78.15 102.91 5028.24 8.74 64.34
>> Long64Vector.UMAXLanes ops/ms 47.56 62.01 46.76 52.28 0.98
>> Long64Vector.UMAXMaskedLanes ops/ms 45.44 46.76 45.79 42.91 1.01
>> Short128Vector.UMAXLanes ops/ms 316.65 410.30 14814.82 23.65 46.79
>> ...
>
> Eric Fang has updated the pull request incrementally with one additional commit since the last revision:
>
> Move helper functions into c2_MacroAssembler_aarch64.hpp
The general way code flows right now, but not often, is from jdk/master to panama-vector/vectorIntrinsics, since most of the development work is in the mainline (exceptions to that are the float16 and Valhalla alignment work which are large efforts).
I am very reluctant to include all the auto-generated micro benchmarks in mainline. There is a huge number of them and i am not certain they provide as much value as they did now we have the IR test framework. In may cases, given the simplicity of what they measure, they were designed to ensure C2 generates the right instructions. The IR test framework is better at determining that by testing the right IR nodes are generated - and they get run as part of the existing HotSpot test suite.
The IR test framework is of course no substitute, in general, for performance tests. A better focus for Vector API performance tests is i think Emanuel's work [here](https://github.com/openjdk/jdk/pull/28639/) and use-cases/algorithms that can be implemented concisely.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/28693#issuecomment-3806851359
More information about the core-libs-dev
mailing list