RFR: 8348638: Performance regression in Math.tanh

Sat Mar 22 00:50:33 UTC 2025

The changes described below are meant to resolve the performance regression introduced by the **x86_64 tanh** double precision floating point scalar intrinsic in #20657.

1. Check and handle high magnitude input values before those in other ranges. If found, **+/- 1** is returned almost immediately without having to go through too many computations or branches.
2. Reduce the lower bound of the input range that triggers a quick **+/- 1** return from **|x| >= 32** to **|x| >= 20**. This new endpoint is the closest value above the minimum (**55 * ln(2) / 2**) required for correctness that's possible when only retrieving the topmost word of the input register.

The results of all tests posted below were captured with an [Intel® Xeon 6761P](https://www.intel.com/content/www/us/en/products/sku/241842/intel-xeon-6761p-processor-336m-cache-2-50-ghz/specifications.html) using [OpenJDK v24-b33](https://github.com/openjdk/jdk/releases/tag/jdk-24%2B33) as the baseline version.

For performance data collected with the regression micro-benchmark referenced in the bug report, see the table below.  Each result is the mean of 3 individual runs. In the high value scenarios (100, 1000, 10000, 100000), the changes significantly improve execution times to the point where are almost at parity with the baseline. Also, there is almost no impact to the low value (1, 2) scenarios. 

| Input range (+/-) | Baseline (ms) | No fix (ms) | With fix (ms) | No fix vs baseline (%) | Fix vs baseline (%) |
| :------------------: | :-------------: | :-----------: | :-------------: | :----------------------: | :-------------------: |
| 1                          | 1842              | 1961           | 1969              | +6.46                          | +6.89                     |
| 2                          | 2102              | 2010           | 1998              | -4.38                           | -4.95                      |
| 100                      | 801                | 1018           | 716                | +27.09                        | -10.61                    |
| 1000                    | 498                | 803             | 519                | +61.24                        | +4.22                     |
| 10000                  | 474                | 755             | 491                | +59.28                        | +3.59                     |
| 100000                | 473                | 758             | 491                | +60.25                        | +3.81                     |

For performance data collected with the built in **tanh** micro-benchmark, see the table below. Each result is the mean of 8 individual runs. Overall, there is no significant impact introduced by the changes. So, the uplift provided by the original implementation of the intrinsic remains.

| Benchmark                     | Throughput without fix (op/s) | Throughput with fix (op/s) | Fix vs No Fix (%) |
| :-------------------------: | :-------------------------------: | :----------------------------: | :-----------------: |
| MathBench.tanhDouble | 103581                                    | 102610                                | -0.94                   |

Finally, the `jtreg:test/jdk/java/lang/Math/HyperbolicTests.java` test passed with the changes.

-------------

Commit messages:
 - Lightly restructure x86_64 tanh instrinsic implementation to resolve performance regressions found for special inputs

Changes: https://git.openjdk.org/jdk/pull/23889/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=23889&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8348638
  Stats: 23 lines in 1 file changed: 6 ins; 7 del; 10 mod
  Patch: https://git.openjdk.org/jdk/pull/23889.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/23889/head:pull/23889

PR: https://git.openjdk.org/jdk/pull/23889