RFR: 8372980: [VectorAPI] AArch64: Add intrinsic support for unsigned min/max reduction operations [v4]
Eric Fang
erfang at openjdk.org
Wed Feb 4 02:47:59 UTC 2026
On Thu, 29 Jan 2026 10:27:56 GMT, Andrew Haley <aph at openjdk.org> wrote:
>>> The IR test framework is better at determining that by testing the right IR nodes are generated - and they get run as part of the existing HotSpot test suite.
>>
>> But as a reviewer I'm not looking at the IR at all, but at the performance.
>
>> Hi @theRealAph @PaulSandoz , thanks for your insight! How to synchronize the JMH micro benchmarks between Panama and the mainline may be a more general issue that requires further investigation, design, and resources. As for how to move this PR forward, my idea is to write a new micro benchmark in this PR to demonstrate the optimization effect of this patch. Would that be acceptable to you?
>
> Sure.
Hi @theRealAph I have added a JMH benchmark file to measure the performance changes of this PR, the test results are as follow:
On a Nvidia Grace machine with 128-bit SVE2:
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:////Users/erfang/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip.htm">
<link rel=File-List
href="file:////Users/erfang/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip_filelist.xml">
</head>
<body link="#467886" vlink="#96607D">
Benchmark | Unit | Before | Error | After | Error | Uplift
-- | -- | -- | -- | -- | -- | --
byteUMaxReduction | ops/ms | 460.81 | 0.56 | 21998.62 | 22.65 | 47.74
byteUMaxReductionMasked | ops/ms | 663.51 | 19.76 | 22687.57 | 24.96 | 34.19
byteUMinReduction | ops/ms | 461.01 | 0.59 | 21965.89 | 29.29 | 47.65
byteUMinReductionMasked | ops/ms | 686.28 | 51.44 | 21234.19 | 8.21 | 30.94
intUMaxReduction | ops/ms | 263.48 | 1.05 | 11133.25 | 8.73 | 42.25
intUMaxReductionMasked | ops/ms | 248.97 | 2.80 | 10790.52 | 4.17 | 43.34
intUMinReduction | ops/ms | 264.20 | 1.18 | 11134.88 | 4.75 | 42.15
intUMinReductionMasked | ops/ms | 243.88 | 1.89 | 10797.43 | 3.21 | 44.27
longUMaxReduction | ops/ms | 133.90 | 0.45 | 5239.43 | 3.34 | 39.13
longUMaxReductionMasked | ops/ms | 125.91 | 1.09 | 5218.63 | 6.26 | 41.45
longUMinReduction | ops/ms | 132.30 | 1.22 | 5233.60 | 5.37 | 39.56
longUMinReductionMasked | ops/ms | 126.78 | 0.98 | 5215.54 | 6.51 | 41.14
shortUMaxReduction | ops/ms | 345.13 | 0.63 | 9763.47 | 9.52 | 28.29
shortUMaxReductionMasked | ops/ms | 440.71 | 21.09 | 10595.24 | 4.12 | 24.04
shortUMinReduction | ops/ms | 345.65 | 0.62 | 10138.08 | 3.29 | 29.33
shortUMinReductionMasked | ops/ms | 414.75 | 26.63 | 10252.82 | 4.92 | 24.72
</body>
</html>
On a Nvidia Grace machine with 128-bit NEON:
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:////Users/erfang/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip.htm">
<link rel=File-List
href="file:////Users/erfang/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip_filelist.xml">
</head>
<body link="#467886" vlink="#96607D">
Benchmark | Unit | Before | Error | After | Error | Uplift
-- | -- | -- | -- | -- | -- | --
byteUMaxReduction | ops/ms | 452.94 | 0.79 | 21597.12 | 76.16 | 47.68
byteUMaxReductionMasked | ops/ms | 654.80 | 65.55 | 21776.97 | 21.80 | 33.26
byteUMinReduction | ops/ms | 452.64 | 0.52 | 21682.22 | 3.50 | 47.90
byteUMinReductionMasked | ops/ms | 603.68 | 35.37 | 20726.86 | 24.27 | 34.33
intUMaxReduction | ops/ms | 260.04 | 1.12 | 10936.03 | 3.31 | 42.05
intUMaxReductionMasked | ops/ms | 243.53 | 1.96 | 10066.68 | 2.17 | 41.34
intUMinReduction | ops/ms | 256.95 | 1.90 | 10934.16 | 5.39 | 42.55
intUMinReductionMasked | ops/ms | 241.82 | 2.14 | 10316.15 | 5.80 | 42.66
longUMaxReduction | ops/ms | 132.05 | 0.34 | 3191.26 | 1.10 | 24.17
longUMaxReductionMasked | ops/ms | 124.59 | 1.01 | 3119.95 | 0.94 | 25.04
longUMinReduction | ops/ms | 131.50 | 0.32 | 3188.86 | 0.99 | 24.25
longUMinReductionMasked | ops/ms | 125.59 | 1.12 | 3118.61 | 0.84 | 24.83
shortUMaxReduction | ops/ms | 343.67 | 0.57 | 9584.07 | 6.83 | 27.89
shortUMaxReductionMasked | ops/ms | 401.15 | 25.37 | 9858.90 | 2.44 | 24.58
shortUMinReduction | ops/ms | 344.17 | 0.79 | 9944.45 | 2.61 | 28.89
shortUMinReductionMasked | ops/ms | 404.24 | 20.85 | 9887.54 | 9.25 | 24.46
</body>
</html>
And we can see similar performance uplift on AWS Graviton3 (Neonverse-V1) and Graviton4 (Neonverse-V2) machines. Would you mind taking another look, thanks a lot !
-------------
PR Comment: https://git.openjdk.org/jdk/pull/28693#issuecomment-3844925369
More information about the core-libs-dev
mailing list