RFR: 8348638: Performance regression in Math.tanh [v5]
Jatin Bhateja
jbhateja at openjdk.org
Thu Apr 17 15:51:45 UTC 2025
On Thu, 10 Apr 2025 00:12:07 GMT, Mohamed Issa <duke at openjdk.org> wrote:
>> The changes described below are meant to resolve the performance regression introduced by the **x86_64 tanh** double precision floating point scalar intrinsic in #20657. Additionally, a new micro-benchmark is included to check the performance of specific input value ranges to help prevent regressions in the future.
>>
>> 1. Check and handle high magnitude input values before those in other ranges. If found, **+/- 1** is returned almost immediately without having to go through too many computations or branches.
>> 2. Reduce the lower bound of the input range that triggers a quick **+/- 1** return from **|x| >= 32** to **|x| >= 22**. This new endpoint is the exact value required for correctness that's used by the original OpenJDK implementation.
>>
>> The results of all tests posted below were captured with an [Intel® Xeon 6761P](https://www.intel.com/content/www/us/en/products/sku/241842/intel-xeon-6761p-processor-336m-cache-2-50-ghz/specifications.html) using [OpenJDK v25-b15](https://github.com/openjdk/jdk/releases/tag/jdk-25%2B15) as the baseline version.
>>
>> For the first set of performance data collected with the new built-in **tanhRange** micro-benchmark, see the table below. Each result is the mean of 8 individual runs, and the input ranges used match those in the bug report with two additional ones included. In the high value scenarios (100, 1000, 10000, 100000), the changes significantly increase throughput values over the baseline. Also, there is almost no impact to the low value (1, 2, 10, 20) scenarios.
>>
>> | Input range(s) | Baseline (ops/s) | Change (ops/s) | Change vs baseline (%) |
>> | :-------------------: | :----------------: | :----------------: | :------------------------: |
>> | [-1, 1] | 26.043 | 25.929 | -0.44 |
>> | [-2, 2] | 25.330 | 25.260 | -0.28 |
>> | [-10, 10] | 24.930 | 24.936 | +0.02 |
>> | [-20, 20] | 24.908 | 24.844 | -0.26 |
>> | [-100, 100] | 53.813 | 76.650 | +42.44 |
>> | [-1000, 1000] | 84.459 | 115.106 | +36.29 |
>> | [-10000, 10000] | 93.980 | 123.320 | +31.22 ...
>
> Mohamed Issa has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision:
>
> Add new tanh micro-benchmark that covers different ranges of input values
Do you think we should modify the ulp threshold of test/jdk/java/lang/Math/HyperbolicTests.java to
2.5 from existing 3.0 to match the specs.
test/micro/org/openjdk/bench/java/lang/MathBench.java line 70:
> 68:
> 69: @Param("0")
> 70: public double tanhBound1;
Suggestion:
@Param("0", "1", "2", "3")
public double tanhRangeIndex;
test/micro/org/openjdk/bench/java/lang/MathBench.java line 73:
> 71:
> 72: @Param("2.7755575615628914E-17")
> 73: public double tanhBound2;
We can declare tanBoundIndex as a Parameter and then select from [hard-coded value ranges](https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/FdLibm.java#L3258), which will allow us to execute all the special ranges and NaN value.
double tanhRangeArray [][] = { 0.0 , 0x1.0P-56}, {0x1.0P-56, 1.0}, {1.0, 22.0}, {22.0, Double.POSITIVE_INFINITY}}
double tanhRangeLowerBound = tanhRangeArray[tanhRangeIndex][0];
double tanhRangeLowerBound = tanhRangeArray[tanhRangeIndex][1];
test/micro/org/openjdk/bench/java/lang/MathBench.java line 549:
> 547: for (int i = 0; i < tanhValueCount; i++) {
> 548: sum += Math.tanh(tanhPosVector[i]) + Math.tanh(tanhNegVector[i]);
> 549: }
You can remove the noise from the benchmark by assiging the array element to a double field in Invocation level setup and then directly pass that as an argument to tanh.
Refer https://github.com/openjdk/jdk/blob/master/test/micro/org/openjdk/bench/java/util/ArraysSort.java#L109
test/micro/org/openjdk/bench/java/lang/MathBench.java line 551:
> 549: }
> 550: return sum;
> 551: }
Please also add benchmark kernels receiving constant inputs i.e. Math.tanh(1.0).
Current handling for transidental intrinsics creates a stub call node during parsing, which leaves no room to perform constant folding Value transforms. Creating a macro IR which runs through GVN optimization and lazily expands to CallNode should fix it. We already have a similar JBS https://bugs.openjdk.org/browse/JDK-8350831 but its good to add a benchmark for now.
-------------
PR Review: https://git.openjdk.org/jdk/pull/23889#pullrequestreview-2775553709
PR Review Comment: https://git.openjdk.org/jdk/pull/23889#discussion_r2048933110
PR Review Comment: https://git.openjdk.org/jdk/pull/23889#discussion_r2048942023
PR Review Comment: https://git.openjdk.org/jdk/pull/23889#discussion_r2048835497
PR Review Comment: https://git.openjdk.org/jdk/pull/23889#discussion_r2048864706
More information about the hotspot-compiler-dev
mailing list