RFR: JDK-8302736: Major performance regression in Math.log on aarch64
Tobias Hartmann
thartmann at openjdk.org
Wed May 10 12:52:37 UTC 2023
On Mon, 24 Apr 2023 08:10:02 GMT, Tobias Holenstein <tholenstein at openjdk.org> wrote:
> ### Performance java.lang.Math exp, log, log10, pow and tan
> The class`java.lang.Math` contains methods for performing basic numeric operations such as the elementary exponential, logarithm, square root, and trigonometric functions. The numeric methods of class `java.lang.StrictMath` are defined to return the bit-for-bit same results on all platforms. The implementations of the equivalent functions in class `java.lang.Math` do not have this requirement. This relaxation permits better-performing implementations where strict reproducibility is not required. By default most of the `java.lang.Math` methods simply call the equivalent method in `java.lang.StrictMath` for their implementation. Code generators (like C2) are encouraged to use platform-specific native libraries or microprocessor instructions, where available, to provide higher-performance implementations of `java.lang.Math` methods. Such higher-performance implementations still must conform to the specification for `java.lang.Math`
>
> Running JMH benchmarks `org.openjdk.bench.java.lang.StrictMathBench` and `org.openjdk.bench.java.lang.MathBench` on `aarch64` shows that for `exp`, `log`, `log10`, `pow` and `tan` `java.lang.Math` is around 10x slower than `java.lang.StrictMath` - which is NOT expected.
>
> ### Reason for major performance regression
> If there is an intrinsic implemented, like for `Math.sin` and `Math.cos`, C2 generates a `StubRoutines`.
> Unfortunately, on `macOS aarch64` there is no intrinsics for `Math.tan`, `Math.exp`, `Math.log`, `Math.pow` and `Math.log10` yet.
>
> _Tracked here:_
> [JDK-8189106 AARCH64: create intrinsic for tan - Java Bug System](https://bugs.openjdk.org/browse/JDK-8189106)
> [JDK-8189107 AARCH64: create intrinsic for pow - Java Bug System](https://bugs.openjdk.org/browse/JDK-8189107)
> [JDK-8307332 AARCH64: create intrinsic for exp - Java Bug System](https://bugs.openjdk.org/browse/JDK-8307332)
> [JDK-8210858 AArch64: Math.log intrinsic gives incorrect results - Java Bug System](https://bugs.openjdk.org/browse/JDK-8210858)
>
> Instead, for `Math.tan`, `Math.exp`, `Math.log`, `Math.pow` and `Math.log10` a call to a `c++` function is generated in `LibraryCallKit::inline_math_native` with `CAST_FROM_FN_PTR(address, SharedRuntime:: dlog)`
>
> The shared runtime functions are implemented in `sharedRuntimeTrans.cpp` as follows:
> ```c++
> JRT_LEAF(jdouble, SharedRuntime::dlog(jdouble x))
> return __ieee754_log(x);
> JRT_END
> ```
>
> `JRT_LEAF ` uses `VM_LEAF_BASE` which puts a write lock on the code cache:
> ```c++
> MACOS_AARCH64_ONLY(ThreadWXEnable __wx(WXWrite, JavaThread::current()));
>
>
> This lock causes the 10x slowdown. Since the shared runtime functions do not access the code cache, the lock is not needed.
>
> ### Side note about WXWrite
> On Apple Silicon the Writer/Execute lock is a new Hardened Runtime capability, see:
> https://developer.apple.com/documentation/apple-silicon/porting-just-in-time-compilers-to-apple-silicon
>
> It prevents memory regions to be writable and executable at the same time. Therefore, we need to aquire `WXWrite` when we want to write to the code cache.
>
> ### Solution: moving WXWrite from JRT_LEAF
> At the moment the `WXWrite` is too coarse grained. This fix removes `WXWrite` lock from `VM_LEAF_BASE` and moves it further down in the call hierarchy. This resolves the performance issue because now the shared runtime functions in `sharedRuntimeTrans.cpp` can be called without the `WXWrite` lock. Overall this change gives performance improvements of 10x for `Math.tan`, `Math.exp`, `Math.log`, `Math.pow` and `Math.log10` on specific JMH benchmarks. Further, it also also give up to 8% performance improvements for example on `SPECjvm2008-XML.transform` on `macOS aarch64`
Nice analysis, Toby. This point fix looks good to me.
As @theRealAph mentioned in the bug comments, and since there are other coarse-grained usages of `ThreadWXEnable` in the code (for example, in the `VM/JTR_ENTRY` macros), please file a follow-up RFE to improve this situation. The `ThreadWXEnable` should be as close as possible to the code that does the actual write access to the code cache.
-------------
Marked as reviewed by thartmann (Reviewer).
PR Review: https://git.openjdk.org/jdk/pull/13606#pullrequestreview-1420558921
More information about the hotspot-dev
mailing list