[lworld+fp16] RFR: 8308363: Initial compiler support for FP16 scalar operations. [v7]

Tue Sep 12 23:03:08 UTC 2023

On Mon, 11 Sep 2023 08:43:07 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Starting with 4th Generation Xeon, Intel has made extensive extensions to existing ISA to support 16 bit scalar and vector floating point operations based on IEEE 754 binary16 format.
>> 
>> We plan to support this in multiple stages spanning across Java side definition of Float16 type, scalar operation and finally SLP vectorization support.
>> 
>> This patch adds  minimal Java and Compiler side support for one API Float16.add.
>> 
>> **Summary of changes :-**
>> - Minimal implementation of Float16 primitive class supporting one operation (Float16.add)
>> - X86 AVX512-FP16 feature detection at VM startup.
>> - C2 IR and Inline expander changes for Float16.add API.
>> - FP16 constant folding handling.
>> - Backend support : Instruction selection patterns and assembler support.
>> - New IR framework and functional tests.
>> 
>> **Implementation details:-**
>> 
>> 1/ Newly defined Float16 class encapsulate a short value holding IEEE 754 binary16 encoded value.
>> 
>> 2/ Float16 is a primitive class which in future will be aligned with other enhanced primitive wrapper classes proposed by [JEP-402.](https://openjdk.org/jeps/402)
>> 
>> 3/ Float16 to support all the operations supported by corresponding Float class.
>> 
>> 4/ Java implementation of each API will internally perform floating point operation at FP32 granularity.
>> 
>> 5/ API which can be directly mapped to an Intel AVX512FP16 instruction will be a candidate for intensification by C2 compiler.
>> 
>> 6/ With Valhalla, C2 compiler always creates an InlineType IR node for a value class instance.
>> Total number of inputs of an InlineType node match the number of non-static fields. In this case node will have one input of short type TypeInt::SHORT.
>> 
>> 7/ Since all the scalar AVX512FP16 instructions operate on floating point registers and Float16 backing storage is held in a general-purpose register hence we need to introduce appropriate conversion IR which moves a 16-bit value from GPR to a XMM register and vice versa.
>> ![image](https://github.com/openjdk/valhalla/assets/59989778/192fca7e-6b7e-4e62-9b09-677e33eca48d)
>> 
>> 8/ Current plan is to introduce a new IR node for each operation which is a subclass of its corresponding single precision IR node. This will allow leveraging idealization routines (Ideal/Identity/Value) of its parent operation.
>> 
>> 9/ All the single/double precision IR nodes carry a Type::FLOAT/DOUBLE ideal type. This represents entire FP32/64 value range and is different from integral types which expli...
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Auto-vectorizer support for Float16.sum operation.

Very good work in general. I only have couple of comments in the code, please take a look. Also there might be some further optimization opportunities for the path that comes from Op_ConvF2HF ->  ReinterpretS2HF.  ConvF2HF is doing the conversion from xmm to xmm register and then moves the xmm to gpr. ReinterpretS2HF then moves from gpr back to xmm. This unnecessary movement from xmm->gpr and from gpr->xmm could be optimized out.

make/common/JavaCompilation.gmk line 277:

> 275: 
> 276:   $1_FLAGS += -g -Xlint:all $$($1_TARGET_RELEASE) $$(PARANOIA_FLAGS) $$(JAVA_WARNINGS_ARE_ERRORS)
> 277:   $1_FLAGS += $$($1_JAVAC_FLAGS) -XDenablePrimitiveClasses

Do we need this change now that we have special handling in the VM for Float16?

src/hotspot/cpu/x86/assembler_x86.cpp line 7332:

> 7330: void Assembler::evaddph(XMMRegister dst, XMMRegister nds, XMMRegister src, int vector_len) {
> 7331:   assert(VM_Version::supports_avx512_fp16(), "requires AVX512-FP16");
> 7332:   InstructionAttr attributes(vector_len, false, /* legacy_mode */ false, /* no_mask_reg */ true, /* uses_vl */ false);

uses_vl should be true here. Also wondering if we could create a generic method like fp16varithop() with most of the boiler plate common code and call it from individual instructions like add, sub etc. This generic method we could either create now or when we include additional fp16 instructions.

src/hotspot/cpu/x86/x86.ad line 10180:

> 10178:   ins_encode %{
> 10179:     __ vmovw($dst$$Register, $src$$XMMRegister);
> 10180:     __ movswl($dst$$Register, $dst$$Register);

Could we do without movswl here?

src/hotspot/share/classfile/vmIntrinsics.hpp line 201:

> 199:   /* Float16 intrinsics, similar to what we have in Math. */                                                            \
> 200:   do_intrinsic(_sum_float16,              java_lang_Float16,      sum_name,           floa16_float16_signature,  F_S)   \
> 201:    do_name(sum_name, "sum")                                                                                             \

could this be called add instead of sum?

-------------

PR Review: https://git.openjdk.org/valhalla/pull/848#pullrequestreview-1622919981
PR Review Comment: https://git.openjdk.org/valhalla/pull/848#discussion_r1323594103
PR Review Comment: https://git.openjdk.org/valhalla/pull/848#discussion_r1323436092
PR Review Comment: https://git.openjdk.org/valhalla/pull/848#discussion_r1323681084
PR Review Comment: https://git.openjdk.org/valhalla/pull/848#discussion_r1323589401