RFR: 8348638: Performance regression in Math.tanh [v8]
Jatin Bhateja
jbhateja at openjdk.org
Fri Apr 25 12:09:50 UTC 2025
On Fri, 25 Apr 2025 00:31:08 GMT, Mohamed Issa <duke at openjdk.org> wrote:
>> The changes described below are meant to resolve the performance regression introduced by the **x86_64 tanh** double precision floating point scalar intrinsic in #20657. Additionally, new constant value micro-benchmarks are included alongside a new micro-benchmark to check the performance of specific input value ranges to help prevent regressions in the future.
>>
>> 1. Check and handle high magnitude input values before those in other ranges. If found, **+/- 1** is returned almost immediately without having to go through too many computations or branches.
>> 2. Reduce the lower bound of the input range that triggers a quick **+/- 1** return from **|x| >= 32** to **|x| >= 22**. This new endpoint is the exact value required for correctness that's used by the original OpenJDK implementation.
>>
>> The results of all tests posted below were captured with an [Intel® Xeon 6761P](https://www.intel.com/content/www/us/en/products/sku/241842/intel-xeon-6761p-processor-336m-cache-2-50-ghz/specifications.html) using [OpenJDK v25-b15](https://github.com/openjdk/jdk/releases/tag/jdk-25%2B15) as the baseline version. The term _baseline1_ refers to runs with the intrinsic enabled and _baseline2_ refers to runs with the intrinsic disabled.
>>
>> For the first set of performance data collected with the new built-in **tanhRange** micro-benchmark, see the tables below. Each result is the mean of 8 individual runs, and the input ranges used match those in the bug report with two additional ones included. In the high value scenarios (100, 1000, 10000, 100000), the changes increase throughput values over _baseline1_. Also, there is a small negative impact to the low value (1, 2, 10, 20) scenarios compared to _baseline1_. When comparing against _baseline2_, the changes have significant uplift with the lower value inputs (1, 2, 10, 20, 100). However, they slightly lag behind _baseline2_ when the high value inputs (1000, 10000, 100000) are used.
>>
>> | Input range(s) | Baseline1 (ops/s) | Change (ops/s) | Change vs baseline1 (%) |
>> | :-------------------: | :-----------------: | :----------------: | :-------------------------: |
>> | [-1, 1] | 22671 | 22190 | -2.12 |
>> | [-2, 2] | 22680 | 22191 | -2.16 |
>> | [-10, 10] | 22683 | 22149 | -2.35 ...
>
> Mohamed Issa has updated the pull request incrementally with one additional commit since the last revision:
>
> Switch to constant double fields with separate micro-benchmarks
Over all the patch looks good to me now apart from concerns around benchmark, existing Java implementation handles special cases upfront, thereby compromising the performance of most common cases. Java implementation scores above intrinsic in two outlier ranges < 2^-55 and > 22. While intrinsic implementation is performant for a meaty generic range ie. > 2^-55 and < 22.0
We get around 30% performance uplift from intrinsic implementation over java implementation for the bulky generic input range.
For ranges above 22.0, we now see better performance in comparison to the earlier intrinsic implementation.
New benchmark show clear gain for the value range [A][B][C] this patch optimizes.
Baseline:
=========
Benchmark (tanhRangeIndex) Mode Cnt Score Error Units
TanhPerf.TanhPerfConstant.tanhConstDouble1 N/A thrpt 2 117588.175 ops/ms
TanhPerf.TanhPerfConstant.tanhConstDouble21 N/A thrpt 2 117550.954 ops/ms
TanhPerf.TanhPerfConstant.tanhConstDoubleLarge N/A thrpt 2 117580.385 ops/ms => A
TanhPerf.TanhPerfConstant.tanhConstDoubleSmall N/A thrpt 2 403652.485 ops/ms
TanhPerf.TanhPerfConstant.tanhConstDoubleTiny N/A thrpt 2 408909.294 ops/ms
TanhPerf.TanhPerfRanges.tanhNegRangeDouble 0 thrpt 2 397200.032 ops/ms
TanhPerf.TanhPerfRanges.tanhNegRangeDouble 1 thrpt 2 116082.297 ops/ms
TanhPerf.TanhPerfRanges.tanhNegRangeDouble 2 thrpt 2 112213.540 ops/ms
TanhPerf.TanhPerfRanges.tanhNegRangeDouble 3 thrpt 2 433899.459 ops/ms => B
TanhPerf.TanhPerfRanges.tanhPosDoubleRange 0 thrpt 2 396818.181 ops/ms
TanhPerf.TanhPerfRanges.tanhPosDoubleRange 1 thrpt 2 115886.117 ops/ms
TanhPerf.TanhPerfRanges.tanhPosDoubleRange 2 thrpt 2 112048.023 ops/ms
TanhPerf.TanhPerfRanges.tanhPosDoubleRange 3 thrpt 2 440250.930 ops/ms => C
WithOpt:
========
Benchmark (tanhRangeIndex) Mode Cnt Score Error Units
TanhPerf.TanhPerfConstant.tanhConstDouble1 N/A thrpt 2 116459.753 ops/ms
TanhPerf.TanhPerfConstant.tanhConstDouble21 N/A thrpt 2 116454.242 ops/ms
TanhPerf.TanhPerfConstant.tanhConstDoubleLarge N/A thrpt 2 521156.905 ops/ms => A
TanhPerf.TanhPerfConstant.tanhConstDoubleSmall N/A thrpt 2 400262.455 ops/ms
TanhPerf.TanhPerfConstant.tanhConstDoubleTiny N/A thrpt 2 400339.293 ops/ms
TanhPerf.TanhPerfRanges.tanhNegRangeDouble 0 thrpt 2 389451.159 ops/ms
TanhPerf.TanhPerfRanges.tanhNegRangeDouble 1 thrpt 2 115750.146 ops/ms
TanhPerf.TanhPerfRanges.tanhNegRangeDouble 2 thrpt 2 112043.952 ops/ms
TanhPerf.TanhPerfRanges.tanhNegRangeDouble 3 thrpt 2 481931.138 ops/ms => B
TanhPerf.TanhPerfRanges.tanhPosDoubleRange 0 thrpt 2 390072.384 ops/ms
TanhPerf.TanhPerfRanges.tanhPosDoubleRange 1 thrpt 2 115738.869 ops/ms
TanhPerf.TanhPerfRanges.tanhPosDoubleRange 2 thrpt 2 111868.620 ops/ms
TanhPerf.TanhPerfRanges.tanhPosDoubleRange 3 thrpt 2 561509.564 ops/ms => C
-------------
PR Comment: https://git.openjdk.org/jdk/pull/23889#issuecomment-2830244448
More information about the hotspot-compiler-dev
mailing list