RFR: 8318562: Computational test more than 2x slower when AVX instructions are used
Sandhya Viswanathan
sviswanathan at openjdk.org
Fri Nov 17 20:07:30 UTC 2023
On Fri, 17 Nov 2023 02:11:29 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:
>> This PR fixes the perf regression seen on AVX for floating point conversions.
>>
>> In AVX the cvt instructions have three operands cvtxx dst, src1, src2. Where src2 is the one being converted. The dst gets the lower bits as the converted value and upper bits (up to 128) from src1.
>>
>> The C2 jit uses the cvtxx dst, dst, src2 flavor. Here the problem was due to uninitialized upper bits of the dst XMM register.
>> Doing an xor dst, dst before the conversion instruction fixes the perf regression.
>>
>> Perf before the patch on UseAVX=3 platform:
>> ComputePI.compute_pi_dbl_flt avgt 5 471.875 ± 0.400 ns/op
>> ComputePI.compute_pi_flt_dbl avgt 5 1877.174 ± 0.557 ns/op
>> ComputePI.compute_pi_int_dbl avgt 5 655.222 ± 28.082 ns/op
>> ComputePI.compute_pi_int_flt avgt 5 737.178 ± 0.077 ns/op
>> ComputePI.compute_pi_long_dbl avgt 5 767.364 ± 0.027 ns/op
>> ComputePI.compute_pi_long_flt avgt 5 587.854 ± 10.068 ns/op
>>
>> Perf after the patch on UseAVX=3 platform:
>> Benchmark Mode Cnt Score Error Units
>> ComputePI.compute_pi_dbl_flt avgt 5 468.328 ± 0.141 ns/op
>> ComputePI.compute_pi_flt_dbl avgt 5 435.430 ± 0.259 ns/op
>> ComputePI.compute_pi_int_dbl avgt 5 424.088 ± 0.050 ns/op
>> ComputePI.compute_pi_int_flt avgt 5 417.345 ± 0.207 ns/op
>> ComputePI.compute_pi_long_dbl avgt 5 425.751 ± 0.006 ns/op
>> ComputePI.compute_pi_long_flt avgt 5 430.199 ± 0.736 ns/op
>
> I confirmed that this change solved performance issue on machines I tested (old Broadwell and Cascade Lake CPUs).
> I am submitting our regular testing for approval.
Thanks a lot for the reviews @vnkozlov @jatin-bhateja @merykitty.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/16701#issuecomment-1817025958
More information about the hotspot-compiler-dev
mailing list