RFR(S) JDK-8205528: Base64 Encode Algorithm using AVX512 Instructions
Kamath, Smita
smita.kamath at intel.com
Fri Jun 22 22:19:08 UTC 2018
Hi Florian,
Thanks a lot for your inputs. Yes, the AVX512 loop will run at lower frequency. We still see 1.5x performance gain as multiple operations are happening per clock.
I cannot use AVX256 because the algorithm is needing some instructions which are only supported in AVX512, for example, vpsrlvw and vpsllvw.
Also, reducing the vector length to half may not show any gain. We are seeing 50% gain with AVX512 currently.
Regards,
Smita
-----Original Message-----
From: Florian Weimer [mailto:fweimer at redhat.com]
Sent: Friday, June 22, 2018 2:23 PM
To: Kamath, Smita <smita.kamath at intel.com>
Cc: Vladimir Kozlov <vladimir.kozlov at oracle.com>; hotspot compiler <hotspot-compiler-dev at openjdk.java.net>
Subject: Re: RFR(S) JDK-8205528: Base64 Encode Algorithm using AVX512 Instructions
On 06/22/2018 10:15 PM, Florian Weimer wrote:
> * Smita Kamath:
>
>> I'd like to contribute an optimization for Base64 Encoding Algorithm
>> using AVX512 Instructions. This optimization shows 1.5x improvement
>> on
>> x86_64 platform(SKL).
>
> Does this code require a turbo license (or whatever the thing is
> called what causes other cores to clock down)?
I found a machine and a silly benchmark calling Encode::encode(byte[]) in a loop, and I get this before:
1.102951702 409,517,502 core_power_lvl1_turbo_license
1.102951702 0 core_power_lvl2_turbo_license
1.102951702 0 core_power_throttle
1.102951702 5,789,506,258 cycles
2.154409863 0 core_power_lvl1_turbo_license
2.154409863 0 core_power_lvl2_turbo_license
2.154409863 0 core_power_throttle
2.154409863 5,578,099,821 cycles
3.205880145 0 core_power_lvl1_turbo_license
3.205880145 0 core_power_lvl2_turbo_license
3.205880145 0 core_power_throttle
3.205880145 4,704,036,297 cycles
4.257820031 0 core_power_lvl1_turbo_license
4.257820031 0 core_power_lvl2_turbo_license
4.257820031 0 core_power_throttle
4.257820031 4,297,183,302 cycles
5.308664009 0 core_power_lvl1_turbo_license
5.308664009 0 core_power_lvl2_turbo_license
5.308664009 0 core_power_throttle
5.308664009 4,272,656,488 cycles
6.360519693 0 core_power_lvl1_turbo_license
6.360519693 0 core_power_lvl2_turbo_license
6.360519693 0 core_power_throttle
6.360519693 4,271,119,933 cycles
7.411707353 0 core_power_lvl1_turbo_license
7.411707353 0 core_power_lvl2_turbo_license
7.411707353 0 core_power_throttle
7.411707353 4,258,814,898 cycles
8.462806875 0 core_power_lvl1_turbo_license
8.462806875 0 core_power_lvl2_turbo_license
8.462806875 0 core_power_throttle
8.462806875 4,273,534,600 cycles
9.513850481 0 core_power_lvl1_turbo_license
9.513850481 0 core_power_lvl2_turbo_license
9.513850481 0 core_power_throttle
9.513850481 4,300,081,431 cycles
10.565774495 0 core_power_lvl1_turbo_license
10.565774495 0 core_power_lvl2_turbo_license
10.565774495 0 core_power_throttle
10.565774495 4,392,364,553 cycles
and after:
1.101046948 2,304,232,482 core_power_lvl1_turbo_license
1.101046948 0 core_power_lvl2_turbo_license
1.101046948 147,688 core_power_throttle
1.101046948 4,577,482,611 cycles
2.151755765 7,278,927,100 core_power_lvl1_turbo_license
2.151755765 0 core_power_lvl2_turbo_license
2.151755765 42,228 core_power_throttle
2.151755765 4,120,536,502 cycles
3.201901416 7,208,954,425 core_power_lvl1_turbo_license
3.201901416 0 core_power_lvl2_turbo_license
3.201901416 67,576 core_power_throttle
3.201901416 5,418,392,188 cycles
4.252669983 7,285,847,565 core_power_lvl1_turbo_license
4.252669983 0 core_power_lvl2_turbo_license
4.252669983 41,600 core_power_throttle
4.252669983 5,199,576,369 cycles
5.304219300 7,277,640,225 core_power_lvl1_turbo_license
5.304219300 0 core_power_lvl2_turbo_license
5.304219300 45,834 core_power_throttle
5.304219300 4,145,273,167 cycles
6.352663275 7,292,924,536 core_power_lvl1_turbo_license
6.352663275 0 core_power_lvl2_turbo_license
6.352663275 44,310 core_power_throttle
6.352663275 10,615,605,184 cycles
7.403349636 7,243,993,590 core_power_lvl1_turbo_license
7.403349636 0 core_power_lvl2_turbo_license
7.403349636 84,554 core_power_throttle
7.403349636 4,135,245,407 cycles
8.453630335 7,275,471,168 core_power_lvl1_turbo_license
8.453630335 0 core_power_lvl2_turbo_license
8.453630335 43,434 core_power_throttle
8.453630335 5,548,353,295 cycles
So the AVX-512 instructions used appear to be low-current ones. Still there is some impact, and for glibc, we tend to avoid using those instructions due to the overall system impact (we've been burnt by this before).
Smita, is it possible to use low-current AVX-256 instructions instead for your optimization?
Thanks,
Florian
More information about the hotspot-compiler-dev
mailing list