RFR: 8318562: Computational test more than 2x slower when AVX instructions are used

Fri Nov 17 20:07:30 UTC 2023

On Fri, 17 Nov 2023 02:11:29 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> This PR fixes the perf regression seen on AVX for floating point conversions.
>> 
>> In AVX the cvt instructions have three operands cvtxx dst, src1, src2.  Where src2 is the one being converted. The dst gets the lower bits as the converted value and upper bits (up to 128) from src1.
>> 
>> The C2 jit uses the cvtxx dst, dst, src2 flavor. Here the problem was due to uninitialized upper bits of the dst XMM register.
>> Doing an xor dst, dst  before the conversion instruction fixes the perf regression. 
>> 
>> Perf before the patch on UseAVX=3 platform:
>> ComputePI.compute_pi_dbl_flt   avgt    5   471.875 ±  0.400  ns/op
>> ComputePI.compute_pi_flt_dbl   avgt    5  1877.174 ±  0.557  ns/op
>> ComputePI.compute_pi_int_dbl   avgt    5   655.222 ± 28.082  ns/op
>> ComputePI.compute_pi_int_flt   avgt    5   737.178 ±  0.077  ns/op
>> ComputePI.compute_pi_long_dbl  avgt    5   767.364 ±  0.027  ns/op
>> ComputePI.compute_pi_long_flt  avgt    5   587.854 ± 10.068  ns/op
>> 
>> Perf after the patch on UseAVX=3 platform:
>> Benchmark                      Mode  Cnt    Score   Error  Units
>> ComputePI.compute_pi_dbl_flt   avgt    5  468.328 ± 0.141  ns/op
>> ComputePI.compute_pi_flt_dbl   avgt    5  435.430 ± 0.259  ns/op
>> ComputePI.compute_pi_int_dbl   avgt    5  424.088 ± 0.050  ns/op
>> ComputePI.compute_pi_int_flt   avgt    5  417.345 ± 0.207  ns/op
>> ComputePI.compute_pi_long_dbl  avgt    5  425.751 ± 0.006  ns/op
>> ComputePI.compute_pi_long_flt  avgt    5  430.199 ± 0.736  ns/op
>
> I confirmed that this change solved performance issue on machines I tested (old Broadwell and Cascade Lake CPUs).
> I am submitting our regular testing for approval.

Thanks a lot for the reviews @vnkozlov @jatin-bhateja @merykitty.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/16701#issuecomment-1817025958