RFR: 8366444: Add support for add/mul reduction operations for Float16
Bhavana Kilambi
bkilambi at openjdk.org
Thu Dec 11 12:09:30 UTC 2025
On Thu, 2 Oct 2025 13:21:32 GMT, Marc Chevalier <mchevalier at openjdk.org> wrote:
>> This patch adds mid-end support for vectorized add/mul reduction operations for half floats. It also includes backend aarch64 support for these operations. Only vectorization support through autovectorization is added as VectorAPI currently does not support Float16 vector species.
>>
>> Both add and mul reduction vectorized through autovectorization mandate the implementation to be strictly ordered. The following is how each of these reductions is implemented for different aarch64 targets -
>>
>> **For AddReduction :**
>> On Neon only targets (UseSVE = 0): Generates scalarized additions using the scalar `fadd` instruction for both 8B and 16B vector lengths. This is because Neon does not provide a direct instruction for computing strictly ordered floating point add reduction.
>>
>> On SVE targets (UseSVE > 0): Generates the `fadda` instruction which computes add reduction for floating point in strict order.
>>
>> **For MulReduction :**
>> Both Neon and SVE do not provide a direct instruction for computing strictly ordered floating point multiply reduction. For vector lengths of 8B and 16B, a scalarized sequence of scalar `fmul` instructions is generated and multiply reduction for vector lengths > 16B is not supported.
>>
>> Below is the performance of the two newly added microbenchmarks in `Float16OperationsBenchmark.java` tested on three different aarch64 machines and with varying `MaxVectorSize` -
>>
>> Note: On all machines, the score (ops/ms) is compared with the master branch without this patch which generates a sequence of loads (`ldrsh`) to load the FP16 value into an FPR and a scalar `fadd/fmul` to add/multiply the loaded value to the running sum/product. The ratios given below are the ratios between the throughput with this patch and the throughput without this patch.
>> Ratio > 1 indicates the performance with this patch is better than the master branch.
>>
>> **N1 (UseSVE = 0, max vector length = 16B):**
>>
>> Benchmark vectorDim Mode Cnt 8B 16B
>> ReductionAddFP16 256 thrpt 9 1.41 1.40
>> ReductionAddFP16 512 thrpt 9 1.41 1.41
>> ReductionAddFP16 1024 thrpt 9 1.43 1.40
>> ReductionAddFP16 2048 thrpt 9 1.43 1.40
>> ReductionMulFP16 256 thrpt 9 1.22 1.22
>> ReductionMulFP16 512 thrpt 9 1.21 1.23
>> ReductionMulFP16 1024 thrpt 9 1.21 1.22
>> ReductionMulFP16 2048 thrpt 9 1.20 1.22
>>
>>
>> On N1, the scalarized sequence of `fadd/fmul` are gener...
>
> I see now the flags are not triviall:
>
> -XX:+UnlockDiagnosticVMOptions -XX:-TieredCompilation -XX:+StressArrayCopyMacroNode -XX:+StressLCM -XX:+StressGCM -XX:+StressIGVN -XX:+StressCCP -XX:+StressMacroExpansion -XX:+StressMethodHandleLinkerInlining -XX:+StressCompiledExceptionHandlers -XX:VerifyConstraintCasts=1 -XX:+StressLoopPeeling
>
> a lot of stress file. It's likely that many runs might be needed to reproduce.
>
> The machine is a VM.Standard.A1.Flex shape, as described in
> https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm
>
> Backtrace at the failure:
>
> Current CompileTask:
> C2:1523 346 % b compiler.vectorization.TestFloat16VectorOperations::vectorAddReductionFloat16 @ 4 (39 bytes)
>
> Stack: [0x0000ffff84799000,0x0000ffff84997000], sp=0x0000ffff849920d0, free space=2020k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
> V [libjvm.so+0x7da724] C2_MacroAssembler::neon_reduce_add_fp16(FloatRegister, FloatRegister, FloatRegister, unsigned int, FloatRegister)+0x2b4 (c2_MacroAssembler_aarch64.cpp:1930)
> V [libjvm.so+0x154492c] PhaseOutput::scratch_emit_size(Node const*)+0x2ec (output.cpp:3171)
> V [libjvm.so+0x153d4a4] PhaseOutput::shorten_branches(unsigned int*)+0x2e4 (output.cpp:528)
> V [libjvm.so+0x154dcdc] PhaseOutput::Output()+0x95c (output.cpp:328)
> V [libjvm.so+0x9be070] Compile::Code_Gen()+0x7f0 (compile.cpp:3127)
> V [libjvm.so+0x9c21c0] Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x1774 (compile.cpp:894)
> V [libjvm.so+0x7eec64] C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x2e0 (c2compiler.cpp:147)
> V [libjvm.so+0x9d0f8c] CompileBroker::invoke_compiler_on_method(CompileTask*)+0xb08 (compileBroker.cpp:2345)
> V [libjvm.so+0x9d1eb8] CompileBroker::compiler_thread_loop()+0x638 (compileBroker.cpp:1989)
> V [libjvm.so+0xed25a8] JavaThread::thread_main_inner()+0x108 (javaThread.cpp:775)
> V [libjvm.so+0x18466dc] Thread::call_run()+0xac (thread.cpp:243)
> V [libjvm.so+0x152349c] thread_native_entry(Thread*)+0x12c (os_linux.cpp:895)
> C [libc.so.6+0x80b50] start_thread+0x300
>
>
> I've attached the replay file in the JBS issue, if it can help.
Hi @marc-chevalier Apologies for the delay in responding to your review comments.
I have been looking at the JTREG test failures you have reported for my patch. It looks like it's not something that's caused by my patch itself. I can reproduce this error on the master branch for the other tests in `compiler/vectorization/TestFloat16VectorOperations.java` as well and it's reproducible on both AArch64 and x86_64 machines. With a quick look, it looks like for some of the failing tests, autovectorization does happen but IR rule still fails because it is expecting vector nodes of a specific shape. Adding `IRNode.VECTOR_SIZE_ANY` helped resolve those failures but still some other tests fail due to autovectorization not happening in them. I feel this needs to be looked at separately as these failures exist on master branch as well and not really caused by this patch.
Would you suggest I create a separate ticket for this task?
-------------
PR Comment: https://git.openjdk.org/jdk/pull/27526#issuecomment-3641187476
More information about the core-libs-dev
mailing list