RFR: 8286823: Default to UseAVX=2 on all Skylake/Cascade Lake CPUs
Oli Gillespie
ogillespie at openjdk.org
Thu Aug 14 11:12:32 UTC 2025
On Mon, 16 May 2022 15:52:22 GMT, Oli Gillespie <ogillespie at openjdk.org> wrote:
> The current code already does this for 'older' Skylake processors,
> namely those with _stepping < 5. My testing indicates this is a
> problem for later processors in this family too, so I have removed the
> max stepping condition.
>
> The original exclusion was added in https://bugs.openjdk.java.net/browse/JDK-8221092.
>
> A general description of the overall issue is given at
> https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Downclocking.
>
> According to https://en.wikichip.org/wiki/intel/microarchitectures/cascade_lake#CPUID,
> stepping values 5..7 indicate Cascade Lake. I have tested on a CPU with stepping=7,
> and I see CPU frequency reduction from 3.1GHz down to 2.7GHz (~23%) when using
> -XX:UseAVX=3, along with a corresponding performance reduction.
>
> I first saw this issue in a real production workload, where the main AVX3 instructions
> being executed were those generated for various flavours of disjoint_arraycopy.
>
> I can reproduce a similar effect using SPECjvm2008's xml.transform benchmark.
>
>
> java --add-opens=java.xml/com.sun.org.apache.xerces.internal.parsers=ALL-UNNAMED \
> --add-opens=java.xml/com.sun.org.apache.xerces.internal.util=ALL-UNNAMED \
> -jar SPECjvm2008.jar -ikv -ict xml.transform
>
>
> Before the change, or with -XX:UseAVX=3:
>
>
> Valid run!
> Score on xml.transform: 776.00 ops/m
>
>
> After the change, or with -XX:UseAVX=2:
>
>
> Valid run!
> Score on xml.transform: 894.07 ops/m
>
>
> So, a 15% improvement in this benchmark. It's possible some benchmarks will be negatively
> affected by this change, but I contend that this is still the right move given the stark
> difference in this benchmark combined with the fact that use of AVX3 instructions can
> affect *all* processes/code on the host due to the downclocking, and the fact that this
> effect is very hard to root-cause, for example CPU profiles look very similar before and
> after since all code is equally slowed.
Thanks! I did that but I have not been able to reproduce it like that, so I'm confused. I've also used https://github.com/travisdowns/avx-turbo to do the same, and equally it doesn't show throttling from those instructions. I must be missing something subtle about the encoding or the register usage.
diff --git a/src/hotspot/cpu/x86/macroAssembler_x86.cpp b/src/hotspot/cpu/x86/macroAssembler_x86.cpp
index 6ebf3b50172..01a67856785 100644
--- a/src/hotspot/cpu/x86/macroAssembler_x86.cpp
+++ b/src/hotspot/cpu/x86/macroAssembler_x86.cpp
@@ -5978,7 +5978,7 @@ void MacroAssembler::xmm_clear_mem(Register base, Register cnt, Register rtmp, X
bool use64byteVector = (MaxVectorSize == 64) && (VM_Version::avx3_threshold() == 0);
if (use64byteVector) {
vpxor(xtmp, xtmp, xtmp, AVX_512bit);
- } else if (MaxVectorSize >= 32) {
+ } else if (MaxVectorSize >= 32 && !UseNewCode) {
vpxor(xtmp, xtmp, xtmp, AVX_256bit);
} else {
pxor(xtmp, xtmp);
@@ -5986,7 +5986,7 @@ void MacroAssembler::xmm_clear_mem(Register base, Register cnt, Register rtmp, X
jmp(L_zero_64_bytes);
BIND(L_loop);
- if (MaxVectorSize >= 32) {
+ if (MaxVectorSize >= 32 && !UseNewCode2) {
fill64(base, 0, xtmp, use64byteVector);
} else {
movdqu(Address(base, 0), xtmp);
Against my benchmark (a subset of mpegaudio):
+UseNewCode -UseNewCode2
-> fill64, no vpxor throttle (2.7GHz)
-UseNewCode +UseNewCode2
-> no fill64, vpxor throttle (2.7GHz)
+UseNewCode +UseNewCode2
-> no fill64, no vpxor no throttle (3.1GHz)
-------------
PR Comment: https://git.openjdk.org/jdk/pull/8731#issuecomment-3188062666
More information about the hotspot-dev
mailing list