PPC64: Poor StrictMath performance due to non-optimized compilation

Thu Nov 17 22:48:07 UTC 2016

Hi Joe,

Although neither a floating point expert (as I think I've proven to you over the years), or a gcc expert, I checked with our in-house gcc expert and got this following answer:

	"Yes using -fno-strict-aliasing fixes the issues.  Also there are many forks of fdlibm which has this fixed including the code inside glibc. "

FWIW,
 - Derek

-----Original Message-----
From: hotspot-dev [mailto:hotspot-dev-bounces at openjdk.java.net] On Behalf Of Chris Plummer
Sent: Thursday, November 17, 2016 4:49 PM
To: joe darcy <joe.darcy at oracle.com>; Gustavo Romero <gromero at linux.vnet.ibm.com>; ppc-aix-port-dev at openjdk.java.net; hotspot-dev at openjdk.java.net; core-libs-dev at openjdk.java.net
Cc: build-dev <build-dev at openjdk.java.net>
Subject: Re: PPC64: Poor StrictMath performance due to non-optimized compilation

On 11/17/16 1:33 PM, joe darcy wrote:
> Hi Gustavo,
>
>
> On 11/17/2016 10:31 AM, Gustavo Romero wrote:
>> Hi Joe,
>>
>> Thanks a lot for your valuable comments.
>>
>> On 17-11-2016 15:35, joe darcy wrote:
>>>> Currently, optimization for building fdlibm is disabled, except for 
>>>> the "solaris" OS target [1].
>>> The reason for that is because historically the Solaris compilers 
>>> have had sufficient discipline and control regarding floating-point 
>>> semantics and compiler optimizations to still implement the 
>>> Java-mandated results when optimization was enabled. The gcc family 
>>> of compilers, for example, has lacked such discipline.
>> oh, I see. Thanks for clarifying that. I was exactly wondering why 
>> fdlibm optimization is off even for x86_x64 as it, AFAICS regarding 
>> gcc 5 only, does not affect the precision, even if setting -O3 does 
>> not improve the performance as much as on PPC64.
>
> The fdlibm code relies on aliasing a two-element array of int with a 
> double to do bit-level reads and writes of floating-point values. As I 
> understand it, the C spec allows compilers to assume values of 
> different types don't overlap in memory. The compilation environment 
> has to be configured in such a way that the C compiler disables code 
> generation and optimization techniques that would run afoul of these 
> fdlibm coding practices.
This is the strict aliasing issue right? It's a long standing problem with fdlibm that kept getting worse as gcc got smarter. IIRC, compiling with -fno-strict-aliasing fixes it, but it's been more than 12 years since I last dealt with fdlibm and compiler aliasing issues.

Chris
>
>>>> As a consequence on PPC64 (Linux) StrictMath methods like, but not 
>>>> limited to, sin(), cos(), and tan() perform verify poor in 
>>>> comparison to the same methods in Math class [2]:
>>> If you are doing your work against JDK 9, note that the pow, hypot, 
>>> and cbrt fdlibm methods required by StrictMath have been ported to 
>>> Java (JDK-8134780: Port fdlibm to Java). I have intentions to port 
>>> the remaining methods to Java, but it is unclear whether or not this 
>>> will occur for JDK 9.
>> Yes, I'm doing my work against 9. So is there any problem if I 
>> proceed with my change? I understand that there is no conflict as 
>> JDK-8134780 progresses and replaces the StrictMath methods by their 
>> counterparts in Java.
>> Please, advice.
>
> If I manage to finish the fdlibm C -> Java port in JDK 9, the changes 
> you are proposing would eventually be removed as unneeded since the C 
> code wouldn't be there to get compiled anymore.
>
>>
>> Is it intended to downport JDK-8134780 to 8?
>
> Such a backport would be technically possible, but we at Oracle don't 
> currently plan to do so.
>
>>
>>
>>> Methods in the Math class, such as pow, are often intrinsified and 
>>> use a different algorithm so a straight performance comparison may 
>>> not be as fair or meaningful in those cases.
>> I agree. It's just that the issue on StrictMath methods was first 
>> noted due to that huge gap (Math vs StrictMath) on PPC64, which is 
>> not prominent on x64.
>
> Depending on how Math.{sin, cos} is implemented on PPC64, compiling 
> the fdlibm sin/cos with more aggressive optimizations should not be 
> expected to close the performance gap. In particular, if Math.{sin, 
> cos} is an intrinsic on PPC64 (I haven't checked the sources) that 
> used platform-specific feature (say fused multiply add instructions) 
> then just compiling fdlibm more aggressively wouldn't necessarily make 
> up that gap.
>
> To allow cross-platform and cross-release reproducibility, StrictMath 
> is specified to use the particular fdlibm algorithms, which precludes 
> using better algorithms developed more recently. If we were to start 
> with a clean slate today, to get such reproducibility we would specify 
> correctly-rounded behavior of all those methods, but such an approach 
> was much less tractable technical 20+ years ago without benefit of the 
> research that was been done in the interim, such as the work of Prof.
> Muller and associates: https://lipforge.ens-lyon.fr/projects/crlibm/.
>
>>
>>
>>> Accumulating the the results of the functions and comparisons the 
>>> sums is not a sufficiently robust way of checking to see if the 
>>> optimized versions are indeed equivalent to the non-optimized ones.
>>> The specification of StrictMath requires a particular result for 
>>> each set of floating-point arguments and sums get round-away 
>>> low-order bits that differ.
>> That's really good point, thanks for letting me know about that. I'll 
>> re-test my change under that perspective.
>>
>>
>>> Running the JDK math library regression tests and corresponding JCK 
>>> tests is recommended for work in this area.
>> Got it. By "the JDK math library regression tests" you mean exactly 
>> which test
>> suite? the jtreg tests?
>
> Specifically, the regression tests under test/java/lang/Math and 
> test/java/lang/StrictMath in the jdk repository. There are some other 
> math library tests in the hotspot repo, but I don't know where they 
> are offhand.
>
> A note on methodologies, when I've been writing test for my port I've 
> tried to include test cases that exercise all the branches point in 
> the code. Due to the large input space (~2^64 for a single-argument 
> method), random sampling alone is an inefficient way to try to find 
> differences in behavior.
>> For testing against JCK/TCK I'll need some help on that.
>>
>
> I believe the JCK/TCK does have additional testcases relevant here.
>
> HTH; thanks,
>
> -Joe