RFR: 8286823: Default to UseAVX=2 on all Skylake/Cascade Lake CPUs

Thu Aug 14 09:45:30 UTC 2025

On Mon, 16 May 2022 15:52:22 GMT, Oli Gillespie <ogillespie at openjdk.org> wrote:

> The current code already does this for 'older' Skylake processors,
> namely those with _stepping < 5. My testing indicates this is a
> problem for later processors in this family too, so I have removed the
> max stepping condition.
> 
> The original exclusion was added in https://bugs.openjdk.java.net/browse/JDK-8221092.
> 
> A general description of the overall issue is given at
> https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Downclocking.
> 
> According to https://en.wikichip.org/wiki/intel/microarchitectures/cascade_lake#CPUID,
> stepping values 5..7 indicate Cascade Lake. I have tested on a CPU with stepping=7,
> and I see CPU frequency reduction from 3.1GHz down to 2.7GHz (~23%) when using
> -XX:UseAVX=3, along with a corresponding performance reduction.
> 
> I first saw this issue in a real production workload, where the main AVX3 instructions
> being executed were those generated for various flavours of disjoint_arraycopy.
> 
> I can reproduce a similar effect using SPECjvm2008's xml.transform benchmark.
> 
> 
> java --add-opens=java.xml/com.sun.org.apache.xerces.internal.parsers=ALL-UNNAMED \
> --add-opens=java.xml/com.sun.org.apache.xerces.internal.util=ALL-UNNAMED \
> -jar SPECjvm2008.jar -ikv -ict xml.transform
> 
> 
> Before the change, or with -XX:UseAVX=3:
> 
> 
> Valid run!
> Score on xml.transform: 776.00 ops/m
> 
> 
> After the change, or with -XX:UseAVX=2:
> 
> 
> Valid run!
> Score on xml.transform: 894.07 ops/m
> 
> 
> So, a 15% improvement in this benchmark. It's possible some benchmarks will be negatively
> affected by this change, but I contend that this is still the right move given the stark
> difference in this benchmark combined with the fact that use of AVX3 instructions can
> affect *all* processes/code on the host due to the downclocking, and the fact that this
> effect is very hard to root-cause, for example CPU profiles look very similar before and
> after since all code is equally slowed.

I came back to this problem recently as I had another case of performance issues in Cascade Lake due to throttling. I'll open another PR with fresh data.

> @olivergillespie you can test your application with -XX:MaxVectorSize=32 product flag to see effects of https://github.com/openjdk/jdk/pull/8877 changes.

This idea makes a lot of sense, but I tested and it doesn't avoid throttling in all (any?) cases. For example with the SPECjvm mpegaudio benchmark, my Cascade Lake processor averages around 2.7GHz with `MaxVectorSize=32`, and 3.09GHz with `UseAVX=2`. I spent some time chasing down which instructions were causing it, but I think there's quite a few. 

Even with `UseAVX=2`, we do not avoid all throttling. For example, `vpxor` and `vpmovdqu` in `MacroAssembler::xmm_clear_mem` are each enough to trigger throttling on their own (I'm not sure why...). I have a benchmark which runs at 2.7GHz with `-XX:UseAVX=2`, and 3.1GHz if using `-XX:-UseXMMForObjInit`, or if I manually skip the `vpxor` and `vpmovdqu` in `xmm_clear_mem`.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/8731#issuecomment-3187811516