RFR: 8348638: Performance regression in Math.tanh [v2]

Wed Apr 2 23:13:49 UTC 2025

On Wed, 2 Apr 2025 18:29:34 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> src/hotspot/cpu/x86/stubGenerator_x86_64_tanh.cpp line 331:
>> 
>>> 329:   __ andl(rdx, rcx);
>>> 330:   __ andl(rcx, 32767);
>>> 331:   __ cmpl(rcx, 16438);
>> 
>> Did you try using "UCOMISD" to directly compare with constant 22.0
>
> [perf_tanh_delimit.txt](https://github.com/user-attachments/files/19573617/perf_tanh_delimit.txt)
> 
> Proposed sequence in micro2 shows better path length,  please give this a try.

So, I didn't try using "UCOMISD" to directly compare because I thought it wouldn't provide a benefit over the existing approach. Thanks for providing these micros though as I think they prove my suspicions. To explain, I'll start with results from the code you provided on the [Intel® Xeon 6761P](https://www.intel.com/content/www/us/en/products/sku/241842/intel-xeon-6761p-processor-336m-cache-2-50-ghz/specifications.html) machine.

**./perf_tanh_delimit 21.0** -> _micro1=184 ms, micro2=186 ms_
**./perf_tanh_delimit 22.0** -> _micro1=188 ms, micro2=186 ms_
**./perf_tanh_delimit 23.0** -> _micro1=187 ms, micro2=183 ms_

Please note that results with inputs strictly less than 22.0 aren't really meaningful because they would go into the heavy compute path of the actual implementations. Still, I included one for reference. Here we can see "UCOMISD" shows some improvement over the existing approach. Of course, this uplift will vary on different platforms. Unfortunately, the sequences provided only cover the positive inputs. To get a better picture, we need one that covers both positive and negative inputs. I created one with corresponding results linked below.

[perf_tanh_delimit2.txt](https://github.com/user-attachments/files/19576620/perf_tanh_delimit2.txt)

**./perf_tanh_delimit2 -23.0** -> _micro1=179 ms, micro2=184 ms_
**./perf_tanh_delimit2 -22.0** -> _micro1=176 ms, micro2=178 ms_
**./perf_tanh_delimit2 -21.0** -> _micro1=185 ms, micro2=181 ms_
**./perf_tanh_delimit2 21.0** -> _micro1=190 ms, micro2=179 ms_
**./perf_tanh_delimit2 22.0** -> _micro1=189 ms, micro2=185 ms_
**./perf_tanh_delimit2 23.0** -> _micro1=187 ms, micro2=185 ms_

Again, the _|x| < 22.0_ inputs aren't relevant because they don't trigger any significant computations. With that in mind, the positive inputs show improvements while the negative inputs don't. The situation would be reversed if I checked for negative input values first. We need two uses of "UCOMISD" to cover positive and negative inputs. Whereas "PEXTRW" is only required once to cover the sign and magnitude. Also, other parts of the intrinsic implementation rely on it, so I don't think we should make those blocks worse by using "UCOMISD" without getting a clear boost from it.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/23889#discussion_r2025719541