RFR: 8373344: Add support for min/max reduction operations for Float16 [v2]
Yi Wu
duke at openjdk.org
Wed Jan 7 17:38:11 UTC 2026
On Tue, 6 Jan 2026 07:28:59 GMT, Xiaohong Gong <xgong at openjdk.org> wrote:
>> Yi Wu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision:
>>
>> - Replace assert with verify
>> - Add IRNode constant and code refactor
>> - Merge remote-tracking branch 'origin/master' into yiwu-8373344
>> - 8373344: Add support for FP16 min/max reduction operations
>>
>> This patch adds mid-end support for vectorized min/max reduction
>> operations for half floats. It also includes backend AArch64 support
>> for these operations.
>> Both floating point min/max reductions don’t require strict order,
>> because they are associative.
>>
>> It will generate NEON fminv/fmaxv reduction instructions when
>> max vector length is 8B or 16B. On SVE supporting machines
>> with vector lengths > 16B, it will generate the SVE fminv/fmaxv
>> instructions.
>> The patch also adds support for partial min/max reductions on
>> SVE machines using fminv/fmaxv.
>>
>> Ratio of throughput(ops/ms) > 1 indicates the performance with
>> this patch is better than the mainline.
>>
>> Neoverse N1 (UseSVE = 0, max vector length = 16B):
>> Benchmark vectorDim Mode Cnt 8B 16B
>> ReductionMaxFP16 256 thrpt 9 3.69 6.44
>> ReductionMaxFP16 512 thrpt 9 3.71 7.62
>> ReductionMaxFP16 1024 thrpt 9 4.16 8.64
>> ReductionMaxFP16 2048 thrpt 9 4.44 9.12
>> ReductionMinFP16 256 thrpt 9 3.69 6.43
>> ReductionMinFP16 512 thrpt 9 3.70 7.62
>> ReductionMinFP16 1024 thrpt 9 4.16 8.64
>> ReductionMinFP16 2048 thrpt 9 4.44 9.10
>>
>> Neoverse V1 (UseSVE = 1, max vector length = 32B):
>> Benchmark vectorDim Mode Cnt 8B 16B 32B
>> ReductionMaxFP16 256 thrpt 9 3.96 8.62 8.02
>> ReductionMaxFP16 512 thrpt 9 3.54 9.25 11.71
>> ReductionMaxFP16 1024 thrpt 9 3.77 8.71 14.07
>> ReductionMaxFP16 2048 thrpt 9 3.88 8.44 14.69
>> ReductionMinFP16 256 thrpt 9 3.96 8.61 8.03
>> ReductionMinFP16 512 thrpt 9 3.54 9.28 11.69
>> ReductionMinFP16 1024 thrpt 9 3.76 8.70 14.12
>> ReductionMinFP16 2048 thrpt 9 3.87 8.45 14.70
>>
>> Neoverse V2 (UseSVE = 2, max vector length = 16B)...
>
> src/hotspot/cpu/aarch64/aarch64_vector.ad line 381:
>
>> 379: case Op_XorReductionV:
>> 380: case Op_MinReductionVHF:
>> 381: case Op_MaxReductionVHF:
>
> We can use the NEON instructions if the vector size <= 16B as well for partial cases. Did you test the performance with NEON instead of using predicated SVE instructions?
You mean move it down, like `Op_AddReductionVI` and `Op_AddReductionVL` to use `return !VM_Version::use_neon_for_vector(length_in_bytes);`?
It doesn't to make much of a difference.
Neoverse V1 (UseSVE = 1, max vector length = 32B)
Benchmark vectorDim Mode Cnt 8B(old) 8B(new) chg2/chg1 16B(old) 16B(new) chg2/chg1 32B(old) 32B(new) chg2/chg1
ReductionMaxFP16 256 thrpt 9 3.96 3.96 1.00 8.63 8.62 1.00 8.02 8.02 1.00
ReductionMaxFP16 512 thrpt 9 3.54 3.54 1.00 9.25 9.25 1.00 11.71 11.71 1.00
ReductionMaxFP16 1024 thrpt 9 3.77 3.77 1.00 8.70 8.71 1.00 14.12 14.07 1.00
ReductionMaxFP16 2048 thrpt 9 3.88 3.88 1.00 8.45 8.44 1.00 14.69 14.69 1.00
ReductionMinFP16 256 thrpt 9 3.96 3.96 1.00 8.62 8.61 1.00 8.02 8.03 1.00
ReductionMinFP16 512 thrpt 9 3.55 3.54 1.00 9.26 9.28 1.00 11.72 11.69 1.00
ReductionMinFP16 1024 thrpt 9 3.76 3.76 1.00 8.69 8.70 1.00 14.10 14.12 1.00
ReductionMinFP16 2048 thrpt 9 3.87 3.87 1.00 8.44 8.45 1.00 14.76 14.70 1.00
Neoverse V2 (UseSVE = 2, max vector length = 16B)
Benchmark vectorDim Mode Cnt 8B(old) 8B(new) chg2/chg1 16B(old) 16B(new) chg2/chg1
ReductionMaxFP16 256 thrpt 9 4.77 4.78 1.00 10.00 10.00 1.00
ReductionMaxFP16 512 thrpt 9 3.75 3.74 1.00 11.32 11.33 1.00
ReductionMaxFP16 1024 thrpt 9 3.87 3.86 1.00 9.59 9.59 1.00
ReductionMaxFP16 2048 thrpt 9 3.94 3.94 1.00 8.72 8.71 1.00
ReductionMinFP16 256 thrpt 9 4.77 4.78 1.00 9.97 10.00 1.00
ReductionMinFP16 512 thrpt 9 3.77 3.74 0.99 11.35 11.29 0.99
ReductionMinFP16 1024 thrpt 9 3.86 3.86 1.00 9.56 9.58 1.00
ReductionMinFP16 2048 thrpt 9 3.94 3.94 1.00 8.71 8.71 1.00
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/28828#discussion_r2669419647
More information about the hotspot-compiler-dev
mailing list