RFR: 8265768 [aarch64] Use glibc libm impl for dlog, dlog10, dexp iff 2.29 or greater on AArch64.

Mon May 24 18:50:29 UTC 2021

On Thu, 15 Apr 2021 08:33:47 GMT, gregcawthorne <github.com+73799211+gregcawthorne at openjdk.org> wrote:

> Glibc 2.29 onwards provides optimised versions of log,log10,exp.
> These functions have an accuracy of 0.9ulp or better in glibc
> 2.29.
> 
> Therefore this patch adds code to parse, store and check
> the runtime glibcs version in os_linux.cpp/hpp.
> This is then used to select the glibcs implementation of
> log, log10, exp at runtime for c1 and c2, iff we have
> glibc 2.29 or greater.
> 
> This will ensure OpenJDK can benefit from future improvements
> to glibc.
> 
> Glibc adheres to the ieee754 standard, unless stated otherwise
> in its spec.
> 
> As there are no stated exceptions in the current glibc spec
> for dlog, dlog10 and dexp, we can assume they currently follow
> ieee754 (which testing confirms). As such, future version of
> glibc are unlikely to lose this compliance with ieee754 in
> future.
> 
> W.r.t performance this patch sees ~15-30% performance improvements for
> log and log10, with ~50-80% performance improvements for exp for the
> common input ranged (which output real numbers). However for the NaN
> and inf output ranges we see a slow down of up to a factor of 2 for
> some functions and architectures.
> 
> Due to this being the uncommon case we assert that this is a
> worthwhile tradeoff.

I have been reading up on the monotonicity paper suggested by Andrew Haley:
[http://www-leland.stanford.edu/class/ee486/doc/ferguson1991.pdf](url)

In order to try and see if I can prove the current glibc implementations of log and exp, for monotonicity.

However, I have come to the conclusion that the paper calculates the relative error threshold for monotonicity for an approximation, and then relies on extra bits of floating-point hardware precision to be guaranteed monotonic. These extra bits of precision are greater than the target representations mantissa bits, which when subsequently rounded at the end (rounding is semi monotonic), leads to a monotonic implementation. No extra bits of floating-point precision are present in AArch64 in-between floating-point operations and so this paper won't help us in this case.

I am currently unsure how the current implementation (from fdlibm I believe) is proven to be monotonic. Perhaps there is a proof I am unaware of. The implementation has obviously stood the test of time at least.

If anyone has an idea of how the existing implementation was proven to be monotonic (if it has been), it could help us to apply it on the Arm optimised routine version (AOR).
What we can do for now is compare the remez approximation used in the current OpenJDK implementation (take log for example):
[https://github.com/openjdk/jdk/blob/master/src/hotspot/share/runtime/sharedRuntimeTrans.cpp#L49](url)

It states:
" The maximum error of this polynomial approximation is bounded by 2**-58.45"
Which must be the theoretical accuracy, as it would be bounded by the 52 bits of mantissa if run on AArch64. And is insufficient for a proof of monotonicity on its own.

Now if you look AOR for log (where the implementation comes from):
[https://github.com/ARM-software/optimized-routines/blob/master/math/tools/log.sollya](url)
[https://github.com/ARM-software/optimized-routines/blob/master/math/tools/log_abs.sollya](url)
And run them, you will see the abs and relative accuracies of ~2**-63 for log.sollya
And for log_abs.sollya which is used for inputs around 1.0, there is an absolute accuracy of ~2**-65 and a relative error of ~ 2**-56.

So for log the theoretical accuracy is actually higher than the current implementations, apart from when near 1.0 the relative error is slightly worse, however I have confirmed with Szabolcs Nagy at Arm who worked on the implementation, that it is the absolute error here which dictates the effective accuracy, as there is arithmetic afterwards which changes the magnitude.
As for the existing accuracy of the exp the implementation code comments state the maximum theoretical error of the remez approximation is 2**-59.

While running the exp soylla script in AOR:
[https://github.com/openjdk/jdk/blob/master/src/hotspot/share/runtime/sharedRuntimeTrans.cpp#L238](url)
It shows its remez has a theoretical accuracy of 2**-66.
[https://github.com/ARM-software/optimized-routines/blob/master/math/tools/exp.sollya](url)

Another thing to consider is the reconstruction process of the current fdlibm implementation and glibcs. glibcs exp and log uses a table lookup algorithm in order to allow their polynomial to have a smaller principle domain around 0 and it is then transformed to a larger principle domain, where as fdlibms does use this method.

A description of the table loop up scheme can be found here:
[https://dl.acm.org/doi/pdf/10.1145/63522.214389](url)
And an analysis of the error bounds of table look type approximations of functions can be found here:
[https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=709374](url)

If the remez polynomial of fdlibm isn’t definitively proven to be monotonic, and the glibcs remez polynomials has comparable accuracy. Then it might suffice to attempt to investigate further the effects of the table lookup method implementation; with regards to semi-monotonicity, even if it wouldn’t offer a total proof of the overall implementation.

My initial look at glibcs' exp implementation is that the relationship between the scale and tail variables, along with the table values themselves possibly appearing monotonic themselves, might indicate that it won't break monotonicity. However this would need further investigation and similarly for log, iff this sub sections proof of monotonicity suffices for the overall implementation.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3510