RFR: 8286823: Default to UseAVX=2 on all Skylake/Cascade Lake CPUs

Wed May 18 18:39:36 UTC 2022

On Wed, 18 May 2022 13:08:54 GMT, olivergillespie <duke at openjdk.java.net> wrote:

>> The current code already does this for 'older' Skylake processors,
>> namely those with _stepping < 5. My testing indicates this is a
>> problem for later processors in this family too, so I have removed the
>> max stepping condition.
>> 
>> The original exclusion was added in https://bugs.openjdk.java.net/browse/JDK-8221092.
>> 
>> A general description of the overall issue is given at
>> https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Downclocking.
>> 
>> According to https://en.wikichip.org/wiki/intel/microarchitectures/cascade_lake#CPUID,
>> stepping values 5..7 indicate Cascade Lake. I have tested on a CPU with stepping=7,
>> and I see CPU frequency reduction from 3.1GHz down to 2.7GHz (~23%) when using
>> -XX:UseAVX=3, along with a corresponding performance reduction.
>> 
>> I first saw this issue in a real production workload, where the main AVX3 instructions
>> being executed were those generated for various flavours of disjoint_arraycopy.
>> 
>> I can reproduce a similar effect using SPECjvm2008's xml.transform benchmark.
>> 
>> 
>> java --add-opens=java.xml/com.sun.org.apache.xerces.internal.parsers=ALL-UNNAMED \
>> --add-opens=java.xml/com.sun.org.apache.xerces.internal.util=ALL-UNNAMED \
>> -jar SPECjvm2008.jar -ikv -ict xml.transform
>> 
>> 
>> Before the change, or with -XX:UseAVX=3:
>> 
>> 
>> Valid run!
>> Score on xml.transform: 776.00 ops/m
>> 
>> 
>> After the change, or with -XX:UseAVX=2:
>> 
>> 
>> Valid run!
>> Score on xml.transform: 894.07 ops/m
>> 
>> 
>> So, a 15% improvement in this benchmark. It's possible some benchmarks will be negatively
>> affected by this change, but I contend that this is still the right move given the stark
>> difference in this benchmark combined with the fact that use of AVX3 instructions can
>> affect *all* processes/code on the host due to the downclocking, and the fact that this
>> effect is very hard to root-cause, for example CPU profiles look very similar before and
>> after since all code is equally slowed.
>
> Below are my complete SPECjvm2008 results, running on an AWS EC2 m4.4xl host (CPU details also shared below), with warmup time of 120s and 1 iteration of 240s per benchmark. My results are somewhat noisy, in part due to running on virtualized hardware.
> 
> There are a range of both regressions and improvements after my change, roughly equal in count and magnitudes. Do keep in mind that any regressions (where UseAVX=2 is slower) are local to that operation, but improvements (where UseAVX=2 is faster) can often be felt by the whole machine - avoiding 15% downclocking is worth a lot more than 15% speedup in one code path. On this basis, I think UseAVX=2 is the right default for this hardware.
> 
> @vnkozlov - I don't see a significant regression in crypto.aes on my runs, could you please share more info about your test and hardware?
> 
> 
> cpu family      : 6
> model           : 85
> model name      : Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
> stepping        : 7
> microcode       : 0x5003103
> 
> 
> Note for the `Change` column: a positive % means the UseAVX=2 run is faster/better (higher ops/m) compared to the UseAVX=3 baseline/default behaviour. Negative means it was slower.
> 
> |                             | UseAVX=3 (ops/m) | UseAVX=2 (ops/m) | Change |
> |-----------------------------|------------------|------------------|--------|
> | startup.helloworld          | 217              | 190              | -12%   |
> | startup.compiler.compiler   | 218              | 205              | -6%    |
> | startup.compiler.sunflow    | 218              | 221              | +1%    |
> | startup.compress            | 42               | 42               | +1%    |
> | startup.crypto.aes          | 18               | 18               | -1%    |
> | startup.crypto.rsa          | 99               | 94               | -5%    |
> | startup.crypto.signverify   | 80               | 77               | -5%    |
> | startup.mpegaudio           | 29               | 28               | -4%    |
> | startup.scimark.fft         | 67               | 69               | +3%    |
> | startup.scimark.lu          | 90               | 75               | -17%   |
> | startup.scimark.monte_carlo | 17               | 17               | +1%    |
> | startup.scimark.sor         | 32               | 35               | +9%    |
> | startup.scimark.sparse      | 42               | 42               | +0%    |
> | startup.serial              | 31               | 32               | +2%    |
> | startup.sunflow             | 37               | 34               | -9%    |
> | startup.xml.transform       | 35               | 35               | -1%    |
> | startup.xml.validation      | 47               | 50               | +7%    |
> | compress                    | 576              | 568              | -1%    |
> | crypto.aes                  | 202              | 200              | -1%    |
> | crypto.rsa                  | 4065             | 4049             | -0%    |
> | crypto.signverify           | 2033             | 1964             | -3%    |
> | derby                       | 1391             | 1411             | +1%    |
> | mpegaudio                   | 324              | 363              | +12%   |
> | scimark.fft.large           | 274              | 271              | -1%    |
> | scimark.lu.large            | 73               | 73               | -0%    |
> | scimark.sor.large           | 160              | 154              | -4%    |
> | scimark.sparse.large        | 155              | 129              | -17%   |
> | scimark.fft.small           | 1340             | 1421             | +6%    |
> | scimark.lu.small            | 1967             | 2467             | +25%   |
> | scimark.sor.small           | 700              | 679              | -3%    |
> | scimark.sparse.small        | 544              | 492              | -10%   |
> | scimark.monte_carlo         | 729              | 700              | -4%    |
> | serial                      | 477              | 466              | -2%    |
> | sunflow                     | 208              | 219              | +5%    |
> | xml.transform               | 778              | 894              | +15%   |
> | xml.validation              | 1610             | 1926             | +20%   |

@olivergillespie I think these benchmarks should be done on bare metal machines. As you have correctly observed, usage of AVX3 can impact the whole CPU. How can you make sure that other applications running in parallel to yours on the same virtualized hardware don't influence your results by using AVX3 themselves?

Also, I think you should run any sub-benchmark in isolation to make sure that the CPU has recovered after potential down-clocking because of a previous sub-benchmark which heavily used AVX3.

It might also be helpful to run a tool like [i7z](https://github.com/ajaiantilal/i7z) in parallel to your benchmarks which displays the clock speed of all CPUs (there might be better tools but this should at least give you an idea about what's going on). Ideally, you should have an additional column for every result which displays the average CPU speed during the benchmark run.

-------------

PR: https://git.openjdk.java.net/jdk/pull/8731