RFR: 8308966 Add intrinsic for float/double modulo for x86 AVX2 and AVX512
Sandhya Viswanathan
sviswanathan at openjdk.org
Tue May 30 22:48:06 UTC 2023
On Tue, 30 May 2023 17:07:01 GMT, Scott Gibbons <sgibbons at openjdk.org> wrote:
> Add an intrinsic for x86 AVX and AVX512 fmod. This addresses both a performance regression and acceleration of the floating point remainder operation (fmod / frem). Also addresses dmod / drem.
>
> Performance has increased an average of ~4x as indicated by the benchmark included with [JDK-8302191](https://bugs.openjdk.org/browse/JDK-8302191).
>
> Old:
> gcc-12.2.1-4.fc36.x86_64
> 3db352d003c5996a5f86f0f465adf86326f7e1fe openjdk21 + fix
> JVM version: 21-internal
> Iteration 0 regression case Took : 89 noMod case took: 39 noPower case took: 68
> Iteration 1 regression case Took : 86 noMod case took: 39 noPower case took: 67
> Iteration 2 regression case Took : 41 noMod case took: 39 noPower case took: 70
> Iteration 3 regression case Took : 41 noMod case took: 39 noPower case took: 69
> Iteration 4 regression case Took : 40 noMod case took: 39 noPower case took: 44
> Iteration 5 regression case Took : 47 noMod case took: 39 noPower case took: 40
> Iteration 6 regression case Took : 41 noMod case took: 39 noPower case took: 40
> Iteration 7 regression case Took : 40 noMod case took: 39 noPower case took: 40
> Iteration 8 regression case Took : 41 noMod case took: 38 noPower case took: 41
> Iteration 9 regression case Took : 40 noMod case took: 39 noPower case took: 40
> New:
> JVM version: 21-internal (float)
> Iteration 0 regression case Took : 24 noMod case took: 11 noPower case took: 42
> Iteration 1 regression case Took : 35 noMod case took: 22 noPower case took: 27
> Iteration 2 regression case Took : 17 noMod case took: 19 noPower case took: 17
> Iteration 3 regression case Took : 17 noMod case took: 3 noPower case took: 16
> Iteration 4 regression case Took : 17 noMod case took: 3 noPower case took: 17
> Iteration 5 regression case Took : 16 noMod case took: 3 noPower case took: 17
> Iteration 6 regression case Took : 16 noMod case took: 3 noPower case took: 17
> Iteration 7 regression case Took : 17 noMod case took: 3 noPower case took: 16
> Iteration 8 regression case Took : 17 noMod case took: 3 noPower case took: 16
> Iteration 9 regression case Took : 17 noMod case took: 3 noPower case took: 17
src/hotspot/cpu/x86/assembler_x86.cpp line 3559:
> 3557:
> 3558: void Assembler::vmovsd(XMMRegister dst, XMMRegister src, XMMRegister src2) {
> 3559: assert(VM_Version::supports_evex(), "");
This instruction is also supported on AVX platforms so the assert check for UseAVX > 0 would be enough, e.g.:
assert(UseAVX > 0, "requires some form of AVX");
src/hotspot/cpu/x86/assembler_x86.cpp line 3562:
> 3560: InstructionMark im(this);
> 3561: InstructionAttr attributes(AVX_128bit, /* rex_w */ true, /* legacy_mode */ false, /* no_mask_reg */ true, /* uses_vl */ false);
> 3562: attributes.set_is_evex_instruction();
set_is_evex instruction not needed as it is also supported on AVX platforms.
src/hotspot/cpu/x86/assembler_x86.cpp line 8922:
> 8920: }
> 8921:
> 8922: void Assembler::evextractps(Register dst, XMMRegister src, uint8_t imm8) {
This instruction is supported across sse4_1, AVX, AVX512 so we could have one flavor extractps() which works across these on similar lines as pextrd.
src/hotspot/cpu/x86/assembler_x86.cpp line 8926:
> 8924: assert(imm8 <= 0x03, "imm8: %u", imm8);
> 8925: InstructionAttr attributes(AVX_128bit, /* vex_w */ true, /* legacy_mode */ false, /* no_mask_reg */ true, /* uses_vl */ true);
> 8926: attributes.set_is_evex_instruction();
assert should check accordingly for SSE4_1 and above.
uses_vl should be false.
set_is_evex_instruction is not needed.
src/hotspot/cpu/x86/assembler_x86.cpp line 9613:
> 9611: assert(VM_Version::supports_evex(), "");
> 9612: InstructionAttr attributes(rmode, /* vex_w */ true, /* legacy_mode */ false, /* no_mask_reg */ true, /* uses_vl */ false);
> 9613: attributes.set_extended_context();
Need to also do set_is_evex_instruction().
src/hotspot/cpu/x86/sharedRuntime_x86.cpp line 90:
> 88: JRT_LEAF(jfloat, SharedRuntime::frem(jfloat x, jfloat y))
> 89: jfloat retval;
> 90: if (UseAVX < 1 || !UseFMA) {
SharedRuntime::frem is called for both 32bit and 64bit.
The new stubs are only for 64 bit so something like below:
if (!is_LP64 || UseAVX < 1 || !UseFMA)
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/14224#discussion_r1210653972
PR Review Comment: https://git.openjdk.org/jdk/pull/14224#discussion_r1210654690
PR Review Comment: https://git.openjdk.org/jdk/pull/14224#discussion_r1210791275
PR Review Comment: https://git.openjdk.org/jdk/pull/14224#discussion_r1210893269
PR Review Comment: https://git.openjdk.org/jdk/pull/14224#discussion_r1210795670
PR Review Comment: https://git.openjdk.org/jdk/pull/14224#discussion_r1210891003
More information about the hotspot-dev
mailing list