RFR: 8348638: Performance regression in Math.tanh [v8]

Fri Apr 25 12:09:50 UTC 2025

On Fri, 25 Apr 2025 00:31:08 GMT, Mohamed Issa <duke at openjdk.org> wrote:

>> The changes described below are meant to resolve the performance regression introduced by the **x86_64 tanh** double precision floating point scalar intrinsic in #20657. Additionally, new constant value micro-benchmarks are included alongside a new micro-benchmark to check the performance of specific input value ranges to help prevent regressions in the future.
>> 
>> 1. Check and handle high magnitude input values before those in other ranges. If found, **+/- 1** is returned almost immediately without having to go through too many computations or branches.
>> 2. Reduce the lower bound of the input range that triggers a quick **+/- 1** return from **|x| >= 32** to **|x| >= 22**. This new endpoint is the exact value required for correctness that's used by the original OpenJDK implementation.
>> 
>> The results of all tests posted below were captured with an [Intel® Xeon 6761P](https://www.intel.com/content/www/us/en/products/sku/241842/intel-xeon-6761p-processor-336m-cache-2-50-ghz/specifications.html) using [OpenJDK v25-b15](https://github.com/openjdk/jdk/releases/tag/jdk-25%2B15) as the baseline version. The term _baseline1_ refers to runs with the intrinsic enabled and _baseline2_ refers to runs with the intrinsic disabled.
>> 
>> For the first set of performance data collected with the new built-in **tanhRange** micro-benchmark, see the tables below.  Each result is the mean of 8 individual runs, and the input ranges used match those in the bug report with two additional ones included. In the high value scenarios (100, 1000, 10000, 100000), the changes increase throughput values over _baseline1_. Also, there is a small negative impact to the low value (1, 2, 10, 20) scenarios compared to _baseline1_. When comparing against _baseline2_, the changes have significant uplift with the lower value inputs (1, 2, 10, 20, 100). However, they slightly lag behind _baseline2_ when the high value inputs (1000, 10000, 100000) are used.
>> 
>> | Input range(s)        | Baseline1 (ops/s) | Change (ops/s) | Change vs baseline1 (%) |
>> | :-------------------: | :-----------------: | :----------------: | :-------------------------: |
>> | [-1, 1]                     | 22671                  | 22190                | -2.12                               |
>> | [-2, 2]                     | 22680                  | 22191                | -2.16                               |
>> | [-10, 10]                 | 22683                  | 22149                | -2.35                          ...
>
> Mohamed Issa has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Switch to constant double fields with separate micro-benchmarks

Over all the patch looks good to me now apart from concerns around benchmark, existing Java implementation handles special cases upfront, thereby compromising the performance of most common cases. Java implementation scores above intrinsic in two outlier ranges < 2^-55 and > 22. While intrinsic implementation is performant for a meaty generic range ie. > 2^-55 and < 22.0
We get around 30% performance uplift from intrinsic implementation over java implementation for the bulky generic input range.
For ranges above 22.0, we now see better performance in comparison to the earlier intrinsic implementation. 

New benchmark show clear gain for the value range [A][B][C] this patch optimizes. 

Baseline:
=========
Benchmark                                       (tanhRangeIndex)   Mode  Cnt       Score   Error   Units
TanhPerf.TanhPerfConstant.tanhConstDouble1                   N/A  thrpt    2  117588.175          ops/ms
TanhPerf.TanhPerfConstant.tanhConstDouble21                  N/A  thrpt    2  117550.954          ops/ms
TanhPerf.TanhPerfConstant.tanhConstDoubleLarge               N/A  thrpt    2  117580.385          ops/ms  => A
TanhPerf.TanhPerfConstant.tanhConstDoubleSmall               N/A  thrpt    2  403652.485          ops/ms
TanhPerf.TanhPerfConstant.tanhConstDoubleTiny                N/A  thrpt    2  408909.294          ops/ms
TanhPerf.TanhPerfRanges.tanhNegRangeDouble                     0  thrpt    2  397200.032          ops/ms
TanhPerf.TanhPerfRanges.tanhNegRangeDouble                     1  thrpt    2  116082.297          ops/ms
TanhPerf.TanhPerfRanges.tanhNegRangeDouble                     2  thrpt    2  112213.540          ops/ms
TanhPerf.TanhPerfRanges.tanhNegRangeDouble                     3  thrpt    2  433899.459          ops/ms  => B
TanhPerf.TanhPerfRanges.tanhPosDoubleRange                     0  thrpt    2  396818.181          ops/ms   
TanhPerf.TanhPerfRanges.tanhPosDoubleRange                     1  thrpt    2  115886.117          ops/ms
TanhPerf.TanhPerfRanges.tanhPosDoubleRange                     2  thrpt    2  112048.023          ops/ms
TanhPerf.TanhPerfRanges.tanhPosDoubleRange                     3  thrpt    2  440250.930          ops/ms  => C

WithOpt:
========
Benchmark                                       (tanhRangeIndex)   Mode  Cnt       Score   Error   Units
TanhPerf.TanhPerfConstant.tanhConstDouble1                   N/A  thrpt    2  116459.753          ops/ms
TanhPerf.TanhPerfConstant.tanhConstDouble21                  N/A  thrpt    2  116454.242          ops/ms
TanhPerf.TanhPerfConstant.tanhConstDoubleLarge               N/A  thrpt    2  521156.905          ops/ms  => A
TanhPerf.TanhPerfConstant.tanhConstDoubleSmall               N/A  thrpt    2  400262.455          ops/ms
TanhPerf.TanhPerfConstant.tanhConstDoubleTiny                N/A  thrpt    2  400339.293          ops/ms
TanhPerf.TanhPerfRanges.tanhNegRangeDouble                     0  thrpt    2  389451.159          ops/ms
TanhPerf.TanhPerfRanges.tanhNegRangeDouble                     1  thrpt    2  115750.146          ops/ms
TanhPerf.TanhPerfRanges.tanhNegRangeDouble                     2  thrpt    2  112043.952          ops/ms
TanhPerf.TanhPerfRanges.tanhNegRangeDouble                     3  thrpt    2  481931.138          ops/ms  => B
TanhPerf.TanhPerfRanges.tanhPosDoubleRange                     0  thrpt    2  390072.384          ops/ms
TanhPerf.TanhPerfRanges.tanhPosDoubleRange                     1  thrpt    2  115738.869          ops/ms
TanhPerf.TanhPerfRanges.tanhPosDoubleRange                     2  thrpt    2  111868.620          ops/ms
TanhPerf.TanhPerfRanges.tanhPosDoubleRange                     3  thrpt    2  561509.564          ops/ms  => C

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23889#issuecomment-2830244448