RFR: 8286823: Default to UseAVX=2 on all Skylake/Cascade Lake CPUs

Thu Aug 14 10:44:33 UTC 2025

On Mon, 16 May 2022 15:52:22 GMT, Oli Gillespie <ogillespie at openjdk.org> wrote:

> The current code already does this for 'older' Skylake processors,
> namely those with _stepping < 5. My testing indicates this is a
> problem for later processors in this family too, so I have removed the
> max stepping condition.
> 
> The original exclusion was added in https://bugs.openjdk.java.net/browse/JDK-8221092.
> 
> A general description of the overall issue is given at
> https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Downclocking.
> 
> According to https://en.wikichip.org/wiki/intel/microarchitectures/cascade_lake#CPUID,
> stepping values 5..7 indicate Cascade Lake. I have tested on a CPU with stepping=7,
> and I see CPU frequency reduction from 3.1GHz down to 2.7GHz (~23%) when using
> -XX:UseAVX=3, along with a corresponding performance reduction.
> 
> I first saw this issue in a real production workload, where the main AVX3 instructions
> being executed were those generated for various flavours of disjoint_arraycopy.
> 
> I can reproduce a similar effect using SPECjvm2008's xml.transform benchmark.
> 
> 
> java --add-opens=java.xml/com.sun.org.apache.xerces.internal.parsers=ALL-UNNAMED \
> --add-opens=java.xml/com.sun.org.apache.xerces.internal.util=ALL-UNNAMED \
> -jar SPECjvm2008.jar -ikv -ict xml.transform
> 
> 
> Before the change, or with -XX:UseAVX=3:
> 
> 
> Valid run!
> Score on xml.transform: 776.00 ops/m
> 
> 
> After the change, or with -XX:UseAVX=2:
> 
> 
> Valid run!
> Score on xml.transform: 894.07 ops/m
> 
> 
> So, a 15% improvement in this benchmark. It's possible some benchmarks will be negatively
> affected by this change, but I contend that this is still the right move given the stark
> difference in this benchmark combined with the fact that use of AVX3 instructions can
> affect *all* processes/code on the host due to the downclocking, and the fact that this
> effect is very hard to root-cause, for example CPU profiles look very similar before and
> after since all code is equally slowed.

> I came back to this problem recently as I had another case of performance issues in Cascade Lake due to throttling. I'll open another PR with fresh data.
> 
> > @olivergillespie you can test your application with -XX:MaxVectorSize=32 product flag to see effects of #8877 changes.
> 
> This idea makes a lot of sense, but I tested and it doesn't avoid throttling in all (any?) cases. For example with the SPECjvm mpegaudio benchmark, my Cascade Lake processor averages around 2.7GHz with `MaxVectorSize=32`, and 3.09GHz with `UseAVX=2`. I spent some time chasing down which instructions were causing it, but I think there's quite a few.
> 
> Even with `UseAVX=2`, we do not avoid all throttling. For example, `vpxor` and `vpmovdqu` in `MacroAssembler::xmm_clear_mem` are each enough to trigger throttling on their own (I'm not sure why...). I have a benchmark which runs at 2.7GHz with `-XX:UseAVX=2`, and 3.1GHz if using `-XX:-UseXMMForObjInit`, or if I manually skip the `vpxor` and `vpmovdqu` in `xmm_clear_mem`.

One good way to justify your claim is to write a C-micro with inline assembly sequence to clear a blob of memory using vmovduq and vpxor instructions operating over YMM register and  XMM registers.  In general, both VMOVDQU and VPXOR are AVX2 light instructions and should fall in same frequency/licensingn level of SSE instructions.

<img width="1162" height="402" alt="image" src="https://github.com/user-attachments/assets/e9ee441a-135c-4d2d-8034-b2038f86a49e" />

-------------

PR Comment: https://git.openjdk.org/jdk/pull/8731#issuecomment-3187981493