RFR(S) JDK-8205528: Base64 Encode Algorithm using AVX512 Instructions

Florian Weimer fweimer at redhat.com
Fri Jun 22 21:22:51 UTC 2018


On 06/22/2018 10:15 PM, Florian Weimer wrote:
> * Smita Kamath:
> 
>> I'd like to contribute an optimization for Base64 Encoding Algorithm
>> using AVX512 Instructions. This optimization shows 1.5x improvement on
>> x86_64 platform(SKL).
> 
> Does this code require a turbo license (or whatever the thing is
> called what causes other cores to clock down)?

I found a machine and a silly benchmark calling Encode::encode(byte[]) 
in a loop, and I get this before:

      1.102951702        409,517,502      core_power_lvl1_turbo_license
      1.102951702                  0      core_power_lvl2_turbo_license
      1.102951702                  0      core_power_throttle
      1.102951702      5,789,506,258      cycles
      2.154409863                  0      core_power_lvl1_turbo_license
      2.154409863                  0      core_power_lvl2_turbo_license
      2.154409863                  0      core_power_throttle
      2.154409863      5,578,099,821      cycles
      3.205880145                  0      core_power_lvl1_turbo_license
      3.205880145                  0      core_power_lvl2_turbo_license
      3.205880145                  0      core_power_throttle
      3.205880145      4,704,036,297      cycles
      4.257820031                  0      core_power_lvl1_turbo_license
      4.257820031                  0      core_power_lvl2_turbo_license
      4.257820031                  0      core_power_throttle
      4.257820031      4,297,183,302      cycles
      5.308664009                  0      core_power_lvl1_turbo_license
      5.308664009                  0      core_power_lvl2_turbo_license
      5.308664009                  0      core_power_throttle
      5.308664009      4,272,656,488      cycles
      6.360519693                  0      core_power_lvl1_turbo_license
      6.360519693                  0      core_power_lvl2_turbo_license
      6.360519693                  0      core_power_throttle
      6.360519693      4,271,119,933      cycles
      7.411707353                  0      core_power_lvl1_turbo_license
      7.411707353                  0      core_power_lvl2_turbo_license
      7.411707353                  0      core_power_throttle
      7.411707353      4,258,814,898      cycles
      8.462806875                  0      core_power_lvl1_turbo_license
      8.462806875                  0      core_power_lvl2_turbo_license
      8.462806875                  0      core_power_throttle
      8.462806875      4,273,534,600      cycles
      9.513850481                  0      core_power_lvl1_turbo_license
      9.513850481                  0      core_power_lvl2_turbo_license
      9.513850481                  0      core_power_throttle
      9.513850481      4,300,081,431      cycles
     10.565774495                  0      core_power_lvl1_turbo_license
     10.565774495                  0      core_power_lvl2_turbo_license
     10.565774495                  0      core_power_throttle
     10.565774495      4,392,364,553      cycles

and after:

      1.101046948      2,304,232,482      core_power_lvl1_turbo_license
      1.101046948                  0      core_power_lvl2_turbo_license
      1.101046948            147,688      core_power_throttle
      1.101046948      4,577,482,611      cycles
      2.151755765      7,278,927,100      core_power_lvl1_turbo_license
      2.151755765                  0      core_power_lvl2_turbo_license
      2.151755765             42,228      core_power_throttle
      2.151755765      4,120,536,502      cycles
      3.201901416      7,208,954,425      core_power_lvl1_turbo_license
      3.201901416                  0      core_power_lvl2_turbo_license
      3.201901416             67,576      core_power_throttle
      3.201901416      5,418,392,188      cycles
      4.252669983      7,285,847,565      core_power_lvl1_turbo_license
      4.252669983                  0      core_power_lvl2_turbo_license
      4.252669983             41,600      core_power_throttle
      4.252669983      5,199,576,369      cycles
      5.304219300      7,277,640,225      core_power_lvl1_turbo_license
      5.304219300                  0      core_power_lvl2_turbo_license
      5.304219300             45,834      core_power_throttle
      5.304219300      4,145,273,167      cycles
      6.352663275      7,292,924,536      core_power_lvl1_turbo_license
      6.352663275                  0      core_power_lvl2_turbo_license
      6.352663275             44,310      core_power_throttle
      6.352663275     10,615,605,184      cycles
      7.403349636      7,243,993,590      core_power_lvl1_turbo_license
      7.403349636                  0      core_power_lvl2_turbo_license
      7.403349636             84,554      core_power_throttle
      7.403349636      4,135,245,407      cycles
      8.453630335      7,275,471,168      core_power_lvl1_turbo_license
      8.453630335                  0      core_power_lvl2_turbo_license
      8.453630335             43,434      core_power_throttle
      8.453630335      5,548,353,295      cycles

So the AVX-512 instructions used appear to be low-current ones.  Still 
there is some impact, and for glibc, we tend to avoid using those 
instructions due to the overall system impact (we've been burnt by this 
before).

Smita, is it possible to use low-current AVX-256 instructions instead 
for your optimization?

Thanks,
Florian


More information about the hotspot-compiler-dev mailing list