[aarch64-port-dev ] RFR: 8189107 - AARCH64: create intrinsic for pow

Fri Aug 24 10:38:21 UTC 2018


On 24/08/18 11:31, Andrew Haley wrote:
> On 08/23/2018 01:31 PM, Dmitrij Pochepko wrote:
>>
>> On 22/08/18 16:43, Andrew Haley wrote:
>>> On 08/22/2018 11:04 AM, Andrew Dinn wrote:
>>>> Thank you for the revised webrev and new test results. I am now working
>>>> through them.
>>> I wonder about the validity of
>>>
>>>        L1X+ x *(L2X+ x *(L3X+  x   *  (L4X+ x *(L5X+ x *L6X)))) is calculated as:
>>>
>>>        L1X+ x *(L2X+ x *L3X)+  x^3 *  (L4X+ x *(L5X+ x *L6X)),
>>>
>>> where L1X+ x *(L2X+ x *L3X)
>>>         L4X+ x *(L5X+ x *L6X) are calculated simultaneously in vector (fmlavs)
>>>
>>>         (On the range [0,0.1716])
>>>
>>>
>>> This transformation looks like a variant of Estrin's scheme, but it's
>>> not quite the same.  I can see no convincing reason why it should be
>>> invalid, but its rounding and underflow behaviour will be different
>>> from Horner's scheme.  Having said that, the use of fmla should mean
>>> that the error is less than the original code, which didn't use fused
>>> multiply-add at all.
>>>
>> well, I suppose the most questionable range is where X is near 0 (it's
>> when input X argument is near 1.0).
>> I created separate brute force test (run in Xcomp), which compares
>> Math.pow with StrictMath.pow using all representable double values
>> within given range and found no differences.
>> I used input argument range 0.9999...1.0001 (so that X values in this
>> polynomial are in [0, 0.000049998]. Input argument range has
>> 1.351079888×10¹² double values and for all these values results were
>> correct.
> Sure, it's probably fine, but that's not really an error analysis.
>
> I'm curious, though: why did you not use a second-order variant of
> Horner's scheme, with one limb calculating the odd powers and the
> other the even powers, combining them with a final fused multiply-add?
> It would be more conventional, and you'd be using multiply-add at
> every stage, minimizing rounding errors.
>
This approach would require calculation of x^2 before issuing fma, 
adding 1 dependent instruction in code path. Current approach performs 
additional calculations (x^3) in parallel with main fma, which is a bit 
faster (~ 1 fpu instruction, which is around 1% performance of overall code)

Thanks,
Dmitrij