RFR: 8353686: Optimize Math.cbrt for x86 64 bit platforms [v4]

Jatin Bhateja jbhateja at openjdk.org
Thu May 29 08:38:53 UTC 2025


On Wed, 28 May 2025 18:39:13 GMT, Mohamed Issa <duke at openjdk.org> wrote:

>> The goal of this PR is to implement an x86_64 intrinsic for java.lang.Math.cbrt() using libm. There is a new set of micro-benchmarks are included to check the performance of specific input value ranges to help prevent regressions in the future.
>> 
>> The command to run all range specific micro-benchmarks is posted below.
>> 
>> `make test TEST="micro:CbrtPerf.CbrtPerfRanges"`
>> 
>> The results of all tests posted below were captured with an [Intel® Xeon 6761P](https://www.intel.com/content/www/us/en/products/sku/241842/intel-xeon-6761p-processor-336m-cache-2-50-ghz/specifications.html) using [OpenJDK v25-b21](https://github.com/openjdk/jdk/releases/tag/jdk-25%2B21) as the baseline version.
>> 
>> For performance data collected with the new built in range micro-benchmark, see the table below. Each result is the mean of 8 individual runs, and the input ranges used match those from the original Java implementation. Overall, the intrinsic provides a major uplift of 169% when very small inputs are used and a more modest uplift of 45% for all other inputs.
>> 
>> | Input range(s)                                  | Baseline throughput (ops/ms) | Intrinsic throughput (ops/ms) | Speedup |
>> | :-------------------------------------: | :-------------------------------: | :-------------------------------: | :---------: |
>> | [-2^(-1022), 2^(-1022)]                   | 6568                                        | 17678                                      | 2.69x       |
>> | (-INF, -2^(-1022)], [2^(-1022), INF) | 138932                                    | 200897                                    | 1.45x       |
>> 
>> Finally, the `jtreg:test/jdk/java/lang/Math/CubeRootTests.java` test passed with the changes.
>
> Mohamed Issa has updated the pull request incrementally with four additional commits since the last revision:
> 
>  - Remove comment mentioning invalid exception when NaN input is provided
>  - Use rcx as base and r8 as index for address calculations in certain cbrt stub generator instructions
>  - Remove unnecessary unpckhpd and unpcklpd definitions in macro-assembler header file
>  - Remove unnecessary movapd definitions in macro-assembler header file

Patch looks good to me,  some comment included.

src/hotspot/cpu/x86/stubGenerator_x86_64_cbrt.cpp line 185:

> 183: 
> 184: #define __ _masm->
> 185: 

Original Intel libm inline sequence uses hexadecimal constants, I would have preferred to use them as it is to maintain 1:1 mapping b/w instruction sequence.

test/micro/org/openjdk/bench/java/lang/CbrtPerf.java line 56:

> 54:     public static class CbrtPerfRanges {
> 55:         public static int cbrtInputCount = 2048;
> 56: 

Please create separate CbrtPerfSpecialValues for +/- 0.0 and +/- Infinity and NaN values.
I understand that handling special cases in intrinsic may impact general case performance but its ok to have atleast micro for it.

test/micro/org/openjdk/bench/java/lang/CbrtPerf.java line 114:

> 112:         public static final double constDouble512 = 512.0;
> 113: 
> 114:         @Benchmark

Baseline:-
Benchmark                                     (cbrtRangeIndex)   Mode  Cnt        Score   Error   Units
CbrtPerf.CbrtPerfConstant.cbrtConstDouble0                 N/A  thrpt    2  2673018.356          ops/ms
CbrtPerf.CbrtPerfConstant.cbrtConstDouble1                 N/A  thrpt    2  2684233.593          ops/ms
CbrtPerf.CbrtPerfConstant.cbrtConstDouble27                N/A  thrpt    2  2684250.835          ops/ms
CbrtPerf.CbrtPerfConstant.cbrtConstDouble512               N/A  thrpt    2  2683616.321          ops/ms
Withopt:-
Benchmark                                     (cbrtRangeIndex)   Mode  Cnt       Score   Error   Units
CbrtPerf.CbrtPerfConstant.cbrtConstDouble0                 N/A  thrpt    2   284575.292          ops/ms
CbrtPerf.CbrtPerfConstant.cbrtConstDouble1                 N/A  thrpt    2   162876.035          ops/ms
CbrtPerf.CbrtPerfConstant.cbrtConstDouble27                N/A  thrpt    2   163227.835          ops/ms
CbrtPerf.CbrtPerfConstant.cbrtConstDouble512               N/A  thrpt    2   162998.844          ops/ms


There is approximaely 10x performance improvement by disabling intrinsic for compile time constant inputs.
I have created a follow up JBS to track it. https://bugs.openjdk.org/browse/JDK-8358039

-------------

PR Review: https://git.openjdk.org/jdk/pull/24470#pullrequestreview-2877492755
PR Review Comment: https://git.openjdk.org/jdk/pull/24470#discussion_r2113462482
PR Review Comment: https://git.openjdk.org/jdk/pull/24470#discussion_r2113484695
PR Review Comment: https://git.openjdk.org/jdk/pull/24470#discussion_r2113472992


More information about the graal-dev mailing list