Math trig intrinsics and compiler options

Thu Aug 6 08:58:59 PDT 2009

gustav trede wrote:
> 2009/7/16 Christian Thalinger <Christian.Thalinger at sun.com 
> <mailto:Christian.Thalinger at sun.com>>
>
>     Azeem Jiva wrote:
>     > Joe,
>     >   Gustav sent me an email asking for help with the
>     intrinsification of
>     > the trig functions and a suggestion I gave him was to not call
>     > fsin/fcos/ftan since those instructions are microcoded on Intel/AMD
>     > hardware and very slow.  Slower than the call to
>     > sharedRuntimeTrig.cpp, and in all cases it's best to stay away from
>     > the hardware instructions.
>
>     I just did some micro-benchmarking on an Intel Core2 Duo and in the
>     range of [0,2pi) inlining the hardware instructions is slightly faster
>     (about 2.5%).  Limiting the range to [0,pi/4) (means no runtime calls)
>     hardware instructions are 1.5x faster.
>
>     I think we should keep the current approach.
>
>     -- Christian
>
>
>
> Neither linux nor the windows platform has compiler opts enabled, only 
> solaris does, it seems when this was evaluated many years ago no other 
> platform had working compilers.
> That fact alone is likely to make the fsin,fcos path faster then the C 
> version for the +-PI/4 range for those platforms.
>
> Its some work to check the current status for the different 
> platforms/compilers regarding if they are still producing bad code 
> with opts or not,
> its however reasonable to expect the compilers to improve over the years.

The code from the non-Sun C compilers is not "bad" per se, it is just 
bad in not implementing the desired semantics of the FDLIBM code, which 
is very sensitive to optimizations legal in C which defeat the purpose 
of the code.  The Sun C compiler can be sufficiently attuned to such 
floating-point need under optimization, the other C compilers were not 
and I suspect still are not.

My preferred long-term approach is to port the FDLIBM C code to Java, 
which I've wanted to do for a while, but has never bubbled to the top of 
my to-do list.

>
> Regarding the proposed patch, sharedRuntimeTrig.cpp usage for the 
> entire input range without external rounding:
> I compare with 3 input,output pairs that has leaked from the JCK, and 
> vs the current Math impl for many input,output pairs and i don't 
> manage to detect any differences.

What is many?  There are on the order of 2^64 inputs to check!

-Joe

>
> There is consistent performance improvement for all input ranges, i 
> get up to 40% improvement for intel core2 on solaris.
>
> Its hard for me to know if there are some corner cases that do require 
> the external rounding in order to stay within the spec, thats the 
> reason i asked for help here.
>
>
> -- 
> regards
>  gustav trede
>
>