RFR: 8318562: Computational test more than 2x slower when AVX instructions are used

Vladimir Kozlov kvn at openjdk.org
Fri Nov 17 01:22:33 UTC 2023


On Thu, 16 Nov 2023 23:46:53 GMT, Sandhya Viswanathan <sviswanathan at openjdk.org> wrote:

> This PR fixes the perf regression seen on AVX for floating point conversions.
> 
> In AVX the cvt instructions have three operands cvtxx dst, src1, src2.  Where src2 is the one being converted. The dst gets the lower bits as the converted value and upper bits (up to 128) from src1.
> 
> The C2 jit uses the cvtxx dst, dst, src2 flavor. Here the problem was due to uninitialized upper bits of the dst XMM register.
> Doing an xor dst, dst  before the conversion instruction fixes the perf regression. 
> 
> Perf before the patch on UseAVX=3 platform:
> ComputePI.compute_pi_dbl_flt   avgt    5   471.875 ±  0.400  ns/op
> ComputePI.compute_pi_flt_dbl   avgt    5  1877.174 ±  0.557  ns/op
> ComputePI.compute_pi_int_dbl   avgt    5   655.222 ± 28.082  ns/op
> ComputePI.compute_pi_int_flt   avgt    5   737.178 ±  0.077  ns/op
> ComputePI.compute_pi_long_dbl  avgt    5   767.364 ±  0.027  ns/op
> ComputePI.compute_pi_long_flt  avgt    5   587.854 ± 10.068  ns/op
> 
> Perf after the patch on UseAVX=3 platform:
> Benchmark                      Mode  Cnt    Score   Error  Units
> ComputePI.compute_pi_dbl_flt   avgt    5  468.328 ± 0.141  ns/op
> ComputePI.compute_pi_flt_dbl   avgt    5  435.430 ± 0.259  ns/op
> ComputePI.compute_pi_int_dbl   avgt    5  424.088 ± 0.050  ns/op
> ComputePI.compute_pi_int_flt   avgt    5  417.345 ± 0.207  ns/op
> ComputePI.compute_pi_long_dbl  avgt    5  425.751 ± 0.006  ns/op
> ComputePI.compute_pi_long_flt  avgt    5  430.199 ± 0.736  ns/op

@sviswa7 thank you for finding the cause! I will test it locally.

-------------

PR Review: https://git.openjdk.org/jdk/pull/16701#pullrequestreview-1735830320


More information about the hotspot-dev mailing list