RFR(S) JDK-8205528: Base64 Encode Algorithm using AVX512 Instructions

Kamath, Smita smita.kamath at intel.com
Fri Jun 22 22:19:08 UTC 2018


Hi Florian,

Thanks a lot for your inputs. Yes, the AVX512 loop will run at lower frequency. We still see 1.5x performance gain as multiple operations are happening per clock.
I cannot use AVX256 because the algorithm is needing some instructions which are only supported in AVX512, for example, vpsrlvw and vpsllvw. 
Also, reducing the vector length to half may not show any gain. We are seeing 50% gain with AVX512 currently.

Regards,
Smita

-----Original Message-----
From: Florian Weimer [mailto:fweimer at redhat.com] 
Sent: Friday, June 22, 2018 2:23 PM
To: Kamath, Smita <smita.kamath at intel.com>
Cc: Vladimir Kozlov <vladimir.kozlov at oracle.com>; hotspot compiler <hotspot-compiler-dev at openjdk.java.net>
Subject: Re: RFR(S) JDK-8205528: Base64 Encode Algorithm using AVX512 Instructions

On 06/22/2018 10:15 PM, Florian Weimer wrote:
> * Smita Kamath:
> 
>> I'd like to contribute an optimization for Base64 Encoding Algorithm 
>> using AVX512 Instructions. This optimization shows 1.5x improvement 
>> on
>> x86_64 platform(SKL).
> 
> Does this code require a turbo license (or whatever the thing is 
> called what causes other cores to clock down)?

I found a machine and a silly benchmark calling Encode::encode(byte[]) in a loop, and I get this before:

      1.102951702        409,517,502      core_power_lvl1_turbo_license
      1.102951702                  0      core_power_lvl2_turbo_license
      1.102951702                  0      core_power_throttle
      1.102951702      5,789,506,258      cycles
      2.154409863                  0      core_power_lvl1_turbo_license
      2.154409863                  0      core_power_lvl2_turbo_license
      2.154409863                  0      core_power_throttle
      2.154409863      5,578,099,821      cycles
      3.205880145                  0      core_power_lvl1_turbo_license
      3.205880145                  0      core_power_lvl2_turbo_license
      3.205880145                  0      core_power_throttle
      3.205880145      4,704,036,297      cycles
      4.257820031                  0      core_power_lvl1_turbo_license
      4.257820031                  0      core_power_lvl2_turbo_license
      4.257820031                  0      core_power_throttle
      4.257820031      4,297,183,302      cycles
      5.308664009                  0      core_power_lvl1_turbo_license
      5.308664009                  0      core_power_lvl2_turbo_license
      5.308664009                  0      core_power_throttle
      5.308664009      4,272,656,488      cycles
      6.360519693                  0      core_power_lvl1_turbo_license
      6.360519693                  0      core_power_lvl2_turbo_license
      6.360519693                  0      core_power_throttle
      6.360519693      4,271,119,933      cycles
      7.411707353                  0      core_power_lvl1_turbo_license
      7.411707353                  0      core_power_lvl2_turbo_license
      7.411707353                  0      core_power_throttle
      7.411707353      4,258,814,898      cycles
      8.462806875                  0      core_power_lvl1_turbo_license
      8.462806875                  0      core_power_lvl2_turbo_license
      8.462806875                  0      core_power_throttle
      8.462806875      4,273,534,600      cycles
      9.513850481                  0      core_power_lvl1_turbo_license
      9.513850481                  0      core_power_lvl2_turbo_license
      9.513850481                  0      core_power_throttle
      9.513850481      4,300,081,431      cycles
     10.565774495                  0      core_power_lvl1_turbo_license
     10.565774495                  0      core_power_lvl2_turbo_license
     10.565774495                  0      core_power_throttle
     10.565774495      4,392,364,553      cycles

and after:

      1.101046948      2,304,232,482      core_power_lvl1_turbo_license
      1.101046948                  0      core_power_lvl2_turbo_license
      1.101046948            147,688      core_power_throttle
      1.101046948      4,577,482,611      cycles
      2.151755765      7,278,927,100      core_power_lvl1_turbo_license
      2.151755765                  0      core_power_lvl2_turbo_license
      2.151755765             42,228      core_power_throttle
      2.151755765      4,120,536,502      cycles
      3.201901416      7,208,954,425      core_power_lvl1_turbo_license
      3.201901416                  0      core_power_lvl2_turbo_license
      3.201901416             67,576      core_power_throttle
      3.201901416      5,418,392,188      cycles
      4.252669983      7,285,847,565      core_power_lvl1_turbo_license
      4.252669983                  0      core_power_lvl2_turbo_license
      4.252669983             41,600      core_power_throttle
      4.252669983      5,199,576,369      cycles
      5.304219300      7,277,640,225      core_power_lvl1_turbo_license
      5.304219300                  0      core_power_lvl2_turbo_license
      5.304219300             45,834      core_power_throttle
      5.304219300      4,145,273,167      cycles
      6.352663275      7,292,924,536      core_power_lvl1_turbo_license
      6.352663275                  0      core_power_lvl2_turbo_license
      6.352663275             44,310      core_power_throttle
      6.352663275     10,615,605,184      cycles
      7.403349636      7,243,993,590      core_power_lvl1_turbo_license
      7.403349636                  0      core_power_lvl2_turbo_license
      7.403349636             84,554      core_power_throttle
      7.403349636      4,135,245,407      cycles
      8.453630335      7,275,471,168      core_power_lvl1_turbo_license
      8.453630335                  0      core_power_lvl2_turbo_license
      8.453630335             43,434      core_power_throttle
      8.453630335      5,548,353,295      cycles

So the AVX-512 instructions used appear to be low-current ones.  Still there is some impact, and for glibc, we tend to avoid using those instructions due to the overall system impact (we've been burnt by this before).

Smita, is it possible to use low-current AVX-256 instructions instead for your optimization?

Thanks,
Florian


More information about the hotspot-compiler-dev mailing list