[aarch64-port-dev ] RFR(s) PPC64/s390x/aarch64: Poor StrictMath performance due to non-optimized compilation

Tue Nov 29 18:06:05 UTC 2016

On Tue, Nov 29, 2016 at 5:31 PM, Gustavo Romero
<gromero at linux.vnet.ibm.com> wrote:
> Hi Andrew,
>
> On 29-11-2016 13:35, Andrew Haley wrote:
>> On 29/11/16 09:41, Volker Simonis wrote:
>>> Thanks Gustavo,
>>>
>>> the change looks good.
>>>
>>> So now we're just waiting for another review from somebody of the aarch64 folks.
>>> Once we have that and the fc-request is approved I'll push the changes.
>>
>> One thing I don't understand:
>>
>> cos 0.17098435541865692 1m7.433s 0.1709843554185943 0m56.678s
>> sin 1.7136493465700289 1m10.654s 1.7136493465700542 0m57.114s
>>
>> Do you know what causes the lower digits to be different?  Is
>> it that Math and StrictMath use different algorithms, not just
>> different optimization levels?
>
> I don't know exactly what's the root cause for that difference (in the result).
> The difference is not present on x64, however on PPC64 even with -O0 (as it is
> by now) that difference exists.
>
> Math methods are intrisified, but StricMath are not. But I understand that Math
> and StrictMath share the fdlibm code since I already changed some code in fdlibm
> that reflected both on Math and StrictMath, so it's not clear to me where the
> Math relaxation occurs on PPC64 (given that such a relaxation is allowed [1]).
>

I think the difference is because Math functions can be intrinsified
(and optimized) while StricMath functions can not.

HotSpot has different ways of intrinsifying the Math functions. If the
CPU is supporting the corresponding function the VM generates special
nodes for that. Otherwise, if there exist special optimized assembler
stubs for a function (e.g. see "StubRoutines::_dsin =
generate_libmSin()" in stubGenerator_x86_64.cpp) the VM makes use of
them. Otherwise it still uses leaf-calls into HotSpots internal
C++-Implementation of the functions (e.g. SharedRuntime::dsin() in
sharedRuntimeTrig.cpp) which are faster than doing a native call into
the fdlibm version.

The implementation in SharedRuntime doesn't has to be "strict" so it
probably uses fused multiplication and it is also build with full
optimization without '-ffp-contract=off' (which is OK in this case).

@Andrew: are you fine with Gustavos latest version of the change?

> For sure others much more experienced than I can comment about difference.
>
>
> Regards,
> Gustavo
>
> [1] https://docs.oracle.com/javase/8/docs/api/java/lang/Math.html
>