RFR: 8286823: Default to UseAVX=2 on all Skylake/Cascade Lake CPUs
Oli Gillespie
ogillespie at openjdk.org
Thu Aug 14 11:28:27 UTC 2025
On Mon, 16 May 2022 15:52:22 GMT, Oli Gillespie <ogillespie at openjdk.org> wrote:
> The current code already does this for 'older' Skylake processors,
> namely those with _stepping < 5. My testing indicates this is a
> problem for later processors in this family too, so I have removed the
> max stepping condition.
>
> The original exclusion was added in https://bugs.openjdk.java.net/browse/JDK-8221092.
>
> A general description of the overall issue is given at
> https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Downclocking.
>
> According to https://en.wikichip.org/wiki/intel/microarchitectures/cascade_lake#CPUID,
> stepping values 5..7 indicate Cascade Lake. I have tested on a CPU with stepping=7,
> and I see CPU frequency reduction from 3.1GHz down to 2.7GHz (~23%) when using
> -XX:UseAVX=3, along with a corresponding performance reduction.
>
> I first saw this issue in a real production workload, where the main AVX3 instructions
> being executed were those generated for various flavours of disjoint_arraycopy.
>
> I can reproduce a similar effect using SPECjvm2008's xml.transform benchmark.
>
>
> java --add-opens=java.xml/com.sun.org.apache.xerces.internal.parsers=ALL-UNNAMED \
> --add-opens=java.xml/com.sun.org.apache.xerces.internal.util=ALL-UNNAMED \
> -jar SPECjvm2008.jar -ikv -ict xml.transform
>
>
> Before the change, or with -XX:UseAVX=3:
>
>
> Valid run!
> Score on xml.transform: 776.00 ops/m
>
>
> After the change, or with -XX:UseAVX=2:
>
>
> Valid run!
> Score on xml.transform: 894.07 ops/m
>
>
> So, a 15% improvement in this benchmark. It's possible some benchmarks will be negatively
> affected by this change, but I contend that this is still the right move given the stark
> difference in this benchmark combined with the fact that use of AVX3 instructions can
> affect *all* processes/code on the host due to the downclocking, and the fact that this
> effect is very hard to root-cause, for example CPU profiles look very similar before and
> after since all code is equally slowed.
FWIW, I also gathered perf counters on a Skylake (stepping=4) host (my EC2 Cascade Lake instance does not expose hardware perf counters).
In the full `mpegaudio` benchmark:
java -jar SPECjvm2008.jar --ignoreCheckTest -ikv --benchmarkThreads 1 --iterationTime 10s --warmupTime 10s mpegaudio
# after warmup
sudo perf stat -a -C 1 -e core_power.lvl0_turbo_license -e core_power.lvl1_turbo_license -e core_power.lvl2_turbo_license -e core_power.throttle -- sleep 5
UseAVX=3
352,262,777 core_power.lvl0_turbo_license
13,606,422,784 core_power.lvl1_turbo_license
1,304,765,823 core_power.lvl2_turbo_license
43,879,483 core_power.throttle
UseAVX=3 MaxVectorSize=32
308,363,591 core_power.lvl0_turbo_license
15,169,908,798 core_power.lvl1_turbo_license
0 core_power.lvl2_turbo_license
103,330 core_power.throttle
UseAVX=2
16,231,858,845 core_power.lvl0_turbo_license
9,584,504 core_power.lvl1_turbo_license
0 core_power.lvl2_turbo_license
59,489 core_power.throttle
So MaxVectorSize=32 does remove lvl2 throttling, but leaves a huge amount of lvl1, which UseAVX2 all but eliminates.
And for my subset benchmark, AVX=2 does not make any difference, but removing xmm_clear_mem clearly does.
UseAVX=3
29,136,581 core_power.lvl0_turbo_license
15,712,155,433 core_power.lvl1_turbo_license
0 core_power.lvl2_turbo_license
8,726 core_power.throttle
UseAVX=2
45,202,936 core_power.lvl0_turbo_license
15,717,333,898 core_power.lvl1_turbo_license
0 core_power.lvl2_turbo_license
10,711 core_power.throttle
UseAVX=3 -UseXMMForObjInit
16,550,998,060 core_power.lvl0_turbo_license
6,456,252 core_power.lvl1_turbo_license
0 core_power.lvl2_turbo_license
482 core_power.throttle
-------------
PR Comment: https://git.openjdk.org/jdk/pull/8731#issuecomment-3188104949
More information about the hotspot-dev
mailing list