RFR: 8323116: [REDO] Computational test more than 2x slower when AVX instructions are used [v4]
Vladimir Kozlov
kvn at openjdk.org
Wed Apr 3 23:31:11 UTC 2024
On Thu, 28 Mar 2024 00:45:33 GMT, Srinivas Vamsi Parasa <duke at openjdk.org> wrote:
>> The goal of this small PR is improve the performance of convert instructions and address the slowdown when AVX>0 is used.
>>
>> The performance data using the ComputePI.java benchmark (part of this PR) is as follows:
>>
>>
>> Benchmark (ns/op) | Stock JDK | This PR (AVX=3) | Speedup
>> -- | -- | -- | --
>> ComputePI.compute_pi_dbl_flt | 511.34 | 510.989 | 1.0
>> ComputePI.compute_pi_flt_dbl | 2024.06 | 518.695 | 3.9
>> ComputePI.compute_pi_int_dbl | 695.482 | 453.054 | 1.5
>> ComputePI.compute_pi_int_flt | 799.268 | 449.83 | 1.8
>> ComputePI.compute_pi_long_dbl | 802.992 | 454.891 | 1.8
>> ComputePI.compute_pi_long_flt | 628.62 | 463.617 | 1.4
>>
>>
>>
>> Benchmark (ns/op) | Stock JDK | This PR (AVX=0) | Speedup
>> -- | -- | -- | --
>> ComputePI.compute_pi_dbl_flt | 473.778 | 472.529 | 1.0
>> ComputePI.compute_pi_flt_dbl | 536.004 | 538.418 | 1.0
>> ComputePI.compute_pi_int_dbl | 458.08 | 460.245 | 1.0
>> ComputePI.compute_pi_int_flt | 477.305 | 476.975 | 1.0
>> ComputePI.compute_pi_long_dbl | 455.132 | 455.064 | 1.0
>> ComputePI.compute_pi_long_flt | 474.734 | 476.571 | 1.0
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision:
>
> fix L2F cvtsi2ssq
Next tests failed when running with `-XX:UseAVX=3 -XX:+UnlockDiagnosticVMOptions -XX:+UseKNLSetting` flags
compiler/intrinsics/zip/TestFpRegsABI.java
compiler/loopopts/superword/TestCmpInvar.java
# Internal Error (/workspace/open/src/hotspot/cpu/x86/assembler_x86.cpp:11719), pid=955891, tid=955918
# assert(((!attributes->uses_vl()) || (attributes->get_vector_len() == AVX_512bit) || (!_legacy_mode_vl) || (attributes->is_legacy_mode()))) failed: XMM register should be 0-15
#
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 23-internal-2024-04-03-2139260.vladimir.kozlov.jdkgit2, mixed mode, sharing, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V [libjvm.so+0x633784] Assembler::vex_prefix_and_encode(int, int, int, Assembler::VexSimdPrefix, Assembler::VexOpcode, InstructionAttr*) [clone .constprop.1]+0x284
#
Current CompileTask:
C2:237 45 % b compiler.intrinsics.zip.TestFpRegsABI$TestIntrinsic::calcValue @ 6 (661 bytes)
Stack: [0x00007f03e044b000,0x00007f03e054b000], sp=0x00007f03e0546830, free space=1006k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0x633784] Assembler::vex_prefix_and_encode(int, int, int, Assembler::VexSimdPrefix, Assembler::VexOpcode, InstructionAttr*) [clone .constprop.1]+0x284 (assembler_x86.cpp:11719)
V [libjvm.so+0x65e21e] Assembler::pxor(XMMRegister, XMMRegister)+0x5e (assembler_x86.cpp:8258)
V [libjvm.so+0x3a5885] convI2D_reg_regNode::emit(CodeBuffer&, PhaseRegAlloc*) const+0x135 (x86_64.ad:10097)
V [libjvm.so+0x14d4386] PhaseOutput::scratch_emit_size(Node const*)+0x376 (output.cpp:3366)
V [libjvm.so+0x14ccaca] PhaseOutput::shorten_branches(unsigned int*)+0x34a (output.cpp:544)
V [libjvm.so+0x14de41a] PhaseOutput::Output()+0xa1a (output.cpp:345)
V [libjvm.so+0x9ec52c] Compile::Code_Gen()+0x4ac (compile.cpp:3031)
V [libjvm.so+0x9ef0a6] Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x1c36 (compile.cpp:894)
-------------
PR Comment: https://git.openjdk.org/jdk/pull/18503#issuecomment-2035803789
More information about the hotspot-compiler-dev
mailing list