RFR: 8286823: Default to UseAVX=2 on all Skylake/Cascade Lake CPUs

Thu Aug 14 11:28:27 UTC 2025

On Mon, 16 May 2022 15:52:22 GMT, Oli Gillespie <ogillespie at openjdk.org> wrote:

> The current code already does this for 'older' Skylake processors,
> namely those with _stepping < 5. My testing indicates this is a
> problem for later processors in this family too, so I have removed the
> max stepping condition.
> 
> The original exclusion was added in https://bugs.openjdk.java.net/browse/JDK-8221092.
> 
> A general description of the overall issue is given at
> https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Downclocking.
> 
> According to https://en.wikichip.org/wiki/intel/microarchitectures/cascade_lake#CPUID,
> stepping values 5..7 indicate Cascade Lake. I have tested on a CPU with stepping=7,
> and I see CPU frequency reduction from 3.1GHz down to 2.7GHz (~23%) when using
> -XX:UseAVX=3, along with a corresponding performance reduction.
> 
> I first saw this issue in a real production workload, where the main AVX3 instructions
> being executed were those generated for various flavours of disjoint_arraycopy.
> 
> I can reproduce a similar effect using SPECjvm2008's xml.transform benchmark.
> 
> 
> java --add-opens=java.xml/com.sun.org.apache.xerces.internal.parsers=ALL-UNNAMED \
> --add-opens=java.xml/com.sun.org.apache.xerces.internal.util=ALL-UNNAMED \
> -jar SPECjvm2008.jar -ikv -ict xml.transform
> 
> 
> Before the change, or with -XX:UseAVX=3:
> 
> 
> Valid run!
> Score on xml.transform: 776.00 ops/m
> 
> 
> After the change, or with -XX:UseAVX=2:
> 
> 
> Valid run!
> Score on xml.transform: 894.07 ops/m
> 
> 
> So, a 15% improvement in this benchmark. It's possible some benchmarks will be negatively
> affected by this change, but I contend that this is still the right move given the stark
> difference in this benchmark combined with the fact that use of AVX3 instructions can
> affect *all* processes/code on the host due to the downclocking, and the fact that this
> effect is very hard to root-cause, for example CPU profiles look very similar before and
> after since all code is equally slowed.

FWIW, I also gathered perf counters on a Skylake (stepping=4) host (my EC2 Cascade Lake instance does not expose hardware perf counters).

In the full `mpegaudio` benchmark:

java -jar SPECjvm2008.jar --ignoreCheckTest -ikv --benchmarkThreads 1 --iterationTime 10s --warmupTime 10s mpegaudio

# after warmup
sudo perf stat -a -C 1 -e core_power.lvl0_turbo_license -e core_power.lvl1_turbo_license -e core_power.lvl2_turbo_license -e core_power.throttle -- sleep 5

UseAVX=3
       352,262,777      core_power.lvl0_turbo_license
    13,606,422,784      core_power.lvl1_turbo_license
     1,304,765,823      core_power.lvl2_turbo_license
        43,879,483      core_power.throttle

UseAVX=3 MaxVectorSize=32
       308,363,591      core_power.lvl0_turbo_license
    15,169,908,798      core_power.lvl1_turbo_license
                 0      core_power.lvl2_turbo_license
           103,330      core_power.throttle

UseAVX=2
    16,231,858,845      core_power.lvl0_turbo_license
         9,584,504      core_power.lvl1_turbo_license
                 0      core_power.lvl2_turbo_license
            59,489      core_power.throttle

So MaxVectorSize=32 does remove lvl2 throttling, but leaves a huge amount of lvl1, which UseAVX2 all but eliminates.

And for my subset benchmark, AVX=2 does not make any difference, but removing xmm_clear_mem clearly does.

UseAVX=3
        29,136,581      core_power.lvl0_turbo_license
    15,712,155,433      core_power.lvl1_turbo_license
                 0      core_power.lvl2_turbo_license
             8,726      core_power.throttle

UseAVX=2
        45,202,936      core_power.lvl0_turbo_license
    15,717,333,898      core_power.lvl1_turbo_license
                 0      core_power.lvl2_turbo_license
            10,711      core_power.throttle

UseAVX=3 -UseXMMForObjInit            
    16,550,998,060      core_power.lvl0_turbo_license
         6,456,252      core_power.lvl1_turbo_license
                 0      core_power.lvl2_turbo_license
               482      core_power.throttle

-------------

PR Comment: https://git.openjdk.org/jdk/pull/8731#issuecomment-3188104949