RFR: 8323116: [REDO] Computational test more than 2x slower when AVX instructions are used [v4]

Wed Apr 3 23:31:11 UTC 2024

On Thu, 28 Mar 2024 00:45:33 GMT, Srinivas Vamsi Parasa <duke at openjdk.org> wrote:

>> The goal of this small PR is improve the performance of convert instructions and address the slowdown when AVX>0 is used.
>> 
>> The performance data using the ComputePI.java benchmark (part of this PR) is as follows:
>> 
>> 
>> Benchmark   (ns/op) | Stock JDK | This PR (AVX=3) | Speedup
>> -- | -- | -- | --
>> ComputePI.compute_pi_dbl_flt | 511.34 | 510.989 | 1.0
>> ComputePI.compute_pi_flt_dbl | 2024.06 | 518.695 | 3.9
>> ComputePI.compute_pi_int_dbl | 695.482 | 453.054 | 1.5
>> ComputePI.compute_pi_int_flt | 799.268 | 449.83 | 1.8
>> ComputePI.compute_pi_long_dbl | 802.992 | 454.891 | 1.8
>> ComputePI.compute_pi_long_flt | 628.62 | 463.617 | 1.4
>> 
>> 
>> 
>> Benchmark (ns/op) | Stock JDK | This PR (AVX=0) | Speedup
>> -- | -- | -- | --
>> ComputePI.compute_pi_dbl_flt | 473.778 | 472.529 | 1.0
>> ComputePI.compute_pi_flt_dbl | 536.004 | 538.418 | 1.0
>> ComputePI.compute_pi_int_dbl | 458.08 | 460.245 | 1.0
>> ComputePI.compute_pi_int_flt | 477.305 | 476.975 | 1.0
>> ComputePI.compute_pi_long_dbl | 455.132 | 455.064 | 1.0
>> ComputePI.compute_pi_long_flt | 474.734 | 476.571 | 1.0
>
> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision:
> 
>   fix L2F cvtsi2ssq

Next tests failed when running with `-XX:UseAVX=3 -XX:+UnlockDiagnosticVMOptions -XX:+UseKNLSetting` flags
compiler/intrinsics/zip/TestFpRegsABI.java
compiler/loopopts/superword/TestCmpInvar.java

#  Internal Error (/workspace/open/src/hotspot/cpu/x86/assembler_x86.cpp:11719), pid=955891, tid=955918
#  assert(((!attributes->uses_vl()) || (attributes->get_vector_len() == AVX_512bit) || (!_legacy_mode_vl) || (attributes->is_legacy_mode()))) failed: XMM register should be 0-15
#
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 23-internal-2024-04-03-2139260.vladimir.kozlov.jdkgit2, mixed mode, sharing, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x633784]  Assembler::vex_prefix_and_encode(int, int, int, Assembler::VexSimdPrefix, Assembler::VexOpcode, InstructionAttr*) [clone .constprop.1]+0x284
#
Current CompileTask:
C2:237   45 %  b        compiler.intrinsics.zip.TestFpRegsABI$TestIntrinsic::calcValue @ 6 (661 bytes)

Stack: [0x00007f03e044b000,0x00007f03e054b000],  sp=0x00007f03e0546830,  free space=1006k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x633784]  Assembler::vex_prefix_and_encode(int, int, int, Assembler::VexSimdPrefix, Assembler::VexOpcode, InstructionAttr*) [clone .constprop.1]+0x284  (assembler_x86.cpp:11719)
V  [libjvm.so+0x65e21e]  Assembler::pxor(XMMRegister, XMMRegister)+0x5e  (assembler_x86.cpp:8258)
V  [libjvm.so+0x3a5885]  convI2D_reg_regNode::emit(CodeBuffer&, PhaseRegAlloc*) const+0x135  (x86_64.ad:10097)
V  [libjvm.so+0x14d4386]  PhaseOutput::scratch_emit_size(Node const*)+0x376  (output.cpp:3366)
V  [libjvm.so+0x14ccaca]  PhaseOutput::shorten_branches(unsigned int*)+0x34a  (output.cpp:544)
V  [libjvm.so+0x14de41a]  PhaseOutput::Output()+0xa1a  (output.cpp:345)
V  [libjvm.so+0x9ec52c]  Compile::Code_Gen()+0x4ac  (compile.cpp:3031)
V  [libjvm.so+0x9ef0a6]  Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x1c36  (compile.cpp:894)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/18503#issuecomment-2035803789