RFR: 8286823: Default to UseAVX=2 on all Skylake/Cascade Lake CPUs

Sandhya Viswanathan sviswanathan at openjdk.java.net
Thu May 19 17:55:55 UTC 2022


On Mon, 16 May 2022 15:52:22 GMT, olivergillespie <duke at openjdk.java.net> wrote:

> The current code already does this for 'older' Skylake processors,
> namely those with _stepping < 5. My testing indicates this is a
> problem for later processors in this family too, so I have removed the
> max stepping condition.
> 
> The original exclusion was added in https://bugs.openjdk.java.net/browse/JDK-8221092.
> 
> A general description of the overall issue is given at
> https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Downclocking.
> 
> According to https://en.wikichip.org/wiki/intel/microarchitectures/cascade_lake#CPUID,
> stepping values 5..7 indicate Cascade Lake. I have tested on a CPU with stepping=7,
> and I see CPU frequency reduction from 3.1GHz down to 2.7GHz (~23%) when using
> -XX:UseAVX=3, along with a corresponding performance reduction.
> 
> I first saw this issue in a real production workload, where the main AVX3 instructions
> being executed were those generated for various flavours of disjoint_arraycopy.
> 
> I can reproduce a similar effect using SPECjvm2008's xml.transform benchmark.
> 
> 
> java --add-opens=java.xml/com.sun.org.apache.xerces.internal.parsers=ALL-UNNAMED \
> --add-opens=java.xml/com.sun.org.apache.xerces.internal.util=ALL-UNNAMED \
> -jar SPECjvm2008.jar -ikv -ict xml.transform
> 
> 
> Before the change, or with -XX:UseAVX=3:
> 
> 
> Valid run!
> Score on xml.transform: 776.00 ops/m
> 
> 
> After the change, or with -XX:UseAVX=2:
> 
> 
> Valid run!
> Score on xml.transform: 894.07 ops/m
> 
> 
> So, a 15% improvement in this benchmark. It's possible some benchmarks will be negatively
> affected by this change, but I contend that this is still the right move given the stark
> difference in this benchmark combined with the fact that use of AVX3 instructions can
> affect *all* processes/code on the host due to the downclocking, and the fact that this
> effect is very hard to root-cause, for example CPU profiles look very similar before and
> after since all code is equally slowed.

>From what I understand, only the core which is executing 512 bit vector instructions will observe this lower frequency and not the entire processor. It is doing double the work per clock during that time so overall we should come out ok. 
@olivergillespie As Volker mentioned, the SPECjvm2008 data that you shared was all over the place. You would agree that we cannot globally change something based on that. In addition, SPECjvm2008 has lot of run-to-run variation in general due to various reasons (e.g. data locality if the threads are spread across multiple socket). Please take that into account as well.

-------------

PR: https://git.openjdk.java.net/jdk/pull/8731


More information about the hotspot-dev mailing list