RFR: 8349138: Optimize Math.copySign API for Intel e-core targets
Jasmine Karthikeyan
jkarthikeyan at openjdk.org
Sat Feb 1 22:31:55 UTC 2025
On Fri, 31 Jan 2025 11:22:47 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
> Math.copySign is only intrinsified on x86 targets supporting the AVX512 feature.
> Intel E-core Xeons support only the AVX2 feature set and still compile Java implementation which is composed of logical operations.
>
> Since there is a 3-cycle penalty for copying incoming float/double values to GPRs before being operated upon by logical operation there is an opportunity to optimize this using an efficient instruction sequence.
>
> Patch uses ANDPS and ANDPD logical instruction to generate efficient instruction sequences to absorb domain copy over penalty. Also, performs minor tuning for existing AVX512 instruction sequence based on VPTERNLOG instruction.
>
> Following are the performance numbers of the following existing microbenchmark
> https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/vm/compiler/Signum.java
>
> Patch passes following validation test
> [test/jdk/java/lang/Math/IeeeRecommendedTests.java
> ](https://github.com/openjdk/jdk/blob/master/test/jdk/java/lang/Math/IeeeRecommendedTests.java)
>
>
> Granite Rapids-AP (P-core Xeon)
> Baseline AVX512:
> Benchmark Mode Cnt Score Error Units
> Signum._5_copySignFloatTest thrpt 2 1296.141 ops/ns
> Signum._7_copySignDoubleTest thrpt 2 838.954 ops/ns
>
> Withopt :
> Benchmark Mode Cnt Score Error Units
> Signum._5_copySignFloatTest thrpt 2 940.240 ops/ns
> Signum._7_copySignDoubleTest thrpt 2 967.370 ops/ns
>
> Baseline AVX2:
> Benchmark Mode Cnt Score Error Units
> Signum._5_copySignFloatTest thrpt 2 63.673 ops/ns
> Signum._7_copySignDoubleTest thrpt 2 26.898 ops/ns
>
> Withopt :
> Benchmark Mode Cnt Score Error Units
> Signum._5_copySignFloatTest thrpt 2 785.801 ops/ns
> Signum._7_copySignDoubleTest thrpt 2 558.710 ops/ns
>
> Sierra Forest (E-core Xeon)
> Baseline:
> Benchmark (seed) Mode Cnt Score Error Units
> o.o.b.vm.compiler.Signum._5_copySignFloatTest N/A thrpt 2 40.528 ops/ns
> o.o.b.vm.compiler.Signum._7_copySignDoubleTest N/A thrpt 2 25.101 ops/ns
>
> Withopt:
> Benchmark (seed) Mode Cnt Score Error Units
> o.o.b.vm.compiler.Signum._5_copySignFloatTest N/A thrpt 2 676.101 ops/ns
> o.o.b.vm.compiler.Signum._7_copySignDoubleTest N/A thrpt 2 ...
I think this is a good improvement! Having more intrinsics available for AVX2 targets is nice. I've left some comments below.
src/hotspot/cpu/x86/x86.ad line 1613:
> 1611: case Op_CopySignD:
> 1612: case Op_CopySignF:
> 1613: if (UseAVX < 1 || !is_LP64) {
Should it be limited to just AVX2, or can the new rules work on AVX1 as well? Since they only use instructions that are available to AVX1.
src/hotspot/cpu/x86/x86.ad line 6769:
> 6767:
> 6768: instruct copySignF_reg_avx(regF dst, regF src, regF xtmp1, regF xtmp2) %{
> 6769: predicate(!VM_Version::supports_avx512vl());
Suggestion:
predicate(UseAVX > 0 && !VM_Version::supports_avx512vl());
Just to be a bit more explicit (and same for the one below).
-------------
PR Review: https://git.openjdk.org/jdk/pull/23386#pullrequestreview-2588420458
PR Review Comment: https://git.openjdk.org/jdk/pull/23386#discussion_r1938356114
PR Review Comment: https://git.openjdk.org/jdk/pull/23386#discussion_r1938356134
More information about the hotspot-compiler-dev
mailing list