RFR(S) 8029302: Performance regression in Math.pow intrinsic

Niclas Adlertz niclas.adlertz at oracle.com
Thu Apr 24 21:18:35 UTC 2014


Thank you Vladimir.

Kind Regards,
Niclas Adlertz

On 04/24/2014 05:20 PM, Vladimir Kozlov wrote:
> Good.
>
> Vladimir
>
> On 4/24/14 2:06 AM, Niclas Adlertz wrote:
>> Yes, http://cr.openjdk.java.net/~adlertz/JDK-8029302/webrev01/
>>
>> I only removed:
>> region_node->init_req(2, if_false);
>>
>> Kind Regards,
>> Niclas Adlertz
>>
>> On 04/23/2014 07:21 PM, Vladimir Kozlov wrote:
>>> On 4/23/14 4:07 AM, Niclas Adlertz wrote:
>>>> Hi Vladimir,
>>>>
>>>>  > Next line is not needed, this edge will be initialized later:
>>>>  >
>>>>  >   region_node->init_req(2, if_false);
>>>> Thanks.
>>>>
>>>>  > And I am not sure that you should skip result check:
>>>>  >
>>>>  >      if (result != result)?  {
>>>>  >        result = uncommon_trap() or runtime_call();
>>>>  >      }
>>>> As I understand it, the reason why we have this check is to see if the
>>>> fast_pow() intrinsic computed a NaN result where we expected a non-NaN
>>>> result.
>>>>
>>>> This can happen in two cases as I see it;
>>>> 1. When x < 0.0
>>>> 2. When x = NaN and y == 0.
>>>>
>>>> The first case will never happen, since we never call fast_pow with x <
>>>> 0.0.
>>>> The second case we could do a special case for, as you mentioned in
>>>> your
>>>> previous mail. (x**0 = 1)
>>>
>>> Based on your explanations current check placement is good. We will not
>>> need it for (x**0 = 1) too.
>>>
>>>>
>>>> There might be more cases when fast_pow() can return a NaN result
>>>> (where
>>>> we expect a non-NaN result) which I haven't spotted. If not, we could
>>>> add a special case for x**0 and move the check of NaN in the end to
>>>> inside the else body inside:
>>>> if (x <= 0.0) {
>>>>    long longy = (long)y;
>>>>    if ((double)longy == y) { // if y is long
>>>>      if (y + 1 == y) longy = 0; // huge number: even
>>>>      result = ((1&longy) == 0)?-DPow(abs(x), y):DPow(abs(x), y);
>>>>    } else {
>>>>      // move result != result check here
>>>>    }
>>>> }
>>>>
>>>
>>> Lets consider this when we add additional optimization.
>>>
>>>> I believe we currently do excessive checking of NaN.
>>>> NaN**y where y != 0 should result in NaN, fast_pow() will return NaN
>>>> here. Despite this, we will still do the result != result check, it
>>>> will
>>>> be true and we will do a call to the runtime.
>>>
>>> Since NaN is an edge case it may not be a matter for now. But I agree
>>> that we can add a check x == NaN and call runtime immediately before
>>> calling fast_pow(). If it does not affect much performance (it is
>>> additional branch) we should go for this change.
>>>
>>>>
>>>> In the case of x**2, I don't see how we can create a non expected NaN
>>>> result, since the only way we can get a NaN result is NaN**2, which
>>>> should result in NaN anyway.
>>>
>>> Agree.
>>>
>>> Do you have latest webrev?
>>>
>>> Vladimir
>>>
>>>>
>>>> Kind Regards,
>>>> Niclas Adlertz
>>>>
>>>> On 04/17/2014 04:14 PM, Vladimir Kozlov wrote:
>>>>> About your changes.
>>>>>
>>>>> Next line is not needed, this edge will be initialized later:
>>>>>
>>>>>   region_node->init_req(2, if_false);
>>>>>
>>>>> And I am not sure that you should skip result check:
>>>>>
>>>>>      if (result != result)?  {
>>>>>        result = uncommon_trap() or runtime_call();
>>>>>      }
>>>>>
>>>>> Thanks,
>>>>> Vladimir
>>>>>
>>>>> On 4/17/14 8:45 AM, Vladimir Kozlov wrote:
>>>>>> Niclas,
>>>>>>
>>>>>> Looking on __ieee754_pow() in sharedRuntimeTrans.cpp and it has other
>>>>>> simple cases:
>>>>>>
>>>>>> x**0 = 1
>>>>>> x**1 = x
>>>>>> x**-1  = 1/x
>>>>>> x**0.5 = sqrt(x)
>>>>>>
>>>>>> It would be nice to know which are frequently used and implement them
>>>>>> too.
>>>>>>
>>>>>> Also there is check for NaN before all this cases except x**0 = 1:
>>>>>>
>>>>>> /* +-NaN return x+y */
>>>>>>
>>>>>> You need to test that new C2 code produces the same results for NaN
>>>>>> values.
>>>>>>
>>>>>> Thanks,
>>>>>> Vladimir
>>>>>>
>>>>>> On 4/17/14 3:10 AM, Niclas Adlertz wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> webrev: http://cr.openjdk.java.net/~adlertz/JDK-8029302/webrev00/
>>>>>>> bug:    https://bugs.openjdk.java.net/browse/JDK-8029302
>>>>>>>
>>>>>>> We have a performance regression in Math.pow(x,2) on x64, starting
>>>>>>> from 7u40.
>>>>>>> In 7u40 we replaced a call to SharedRuntime::dpow with an intrinsic
>>>>>>> for Math.pow. This is faster in almost all cases,
>>>>>>> except for Math.pow(x,2). (See comments in bug report for more
>>>>>>> info.)
>>>>>>>
>>>>>>> I have added a C2 IR check for Math.pow(x,y) when y == 2, and
>>>>>>> instead
>>>>>>> of calling SharedRuntime::dpow when y == 2, I
>>>>>>> directly do x * x.
>>>>>>>
>>>>>>> I've changed the generated C2 IR,
>>>>>>>
>>>>>>>  From (psuedo code):
>>>>>>>
>>>>>>> if (x <= 0.0) {
>>>>>>>    long longy = (long)y;
>>>>>>>    if ((double)longy == y) { // if y is long
>>>>>>>      if (y + 1 == y) longy = 0; // huge number: even
>>>>>>>      result = ((1&longy) == 0)?-DPow(abs(x), y):DPow(abs(x), y);
>>>>>>>    } else {
>>>>>>>      result = NaN;
>>>>>>>    }
>>>>>>> } else {
>>>>>>>    result = DPow(x,y);
>>>>>>> }
>>>>>>> if (result != result)?  {
>>>>>>>    result = uncommon_trap() or runtime_call();
>>>>>>> }
>>>>>>> return result;
>>>>>>>
>>>>>>> To (psuedo code):
>>>>>>>
>>>>>>> if (y == 2) {
>>>>>>>    return x * x;
>>>>>>> } else {
>>>>>>>    if (x <= 0.0) {
>>>>>>>      long longy = (long)y;
>>>>>>>      if ((double)longy == y) { // if y is long
>>>>>>>        if (y + 1 == y) longy = 0; // huge number: even
>>>>>>>        result = ((1&longy) == 0)?-DPow(abs(x), y):DPow(abs(x), y);
>>>>>>>      } else {
>>>>>>>        result = NaN;
>>>>>>>      }
>>>>>>>    } else {
>>>>>>>      result = DPow(x,y);
>>>>>>>    }
>>>>>>>    if (result != result)?  {
>>>>>>>      result = uncommon_trap() or runtime_call();
>>>>>>>    }
>>>>>>>    return result;
>>>>>>> }
>>>>>>>
>>>>>>> I have run jtreg tests in jdk/tests/java/lang (with -server, -Xcomp
>>>>>>> and -XX:-TieredCompilation) and run JPRT. No
>>>>>>> problems encountered.
>>>>>>> In particular, java/lang/Math/PowTests passes.
>>>>>>>
>>>>>>> I re-wrote the performance test included in the bug report
>>>>>>> (https://bugs.openjdk.java.net/secure/attachment/17807/Main.java)
>>>>>>> to a JMH test;
>>>>>>> http://cr.openjdk.java.net/~adlertz/JDK-8029302/webrev00/MyBenchmark.java
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Below are the performance results. The x^2 case is now much faster
>>>>>>> even compared to 7u25. (Since we now skip the call to
>>>>>>> SharedRuntime::dpow)
>>>>>>>
>>>>>>> Numbers from 7u25 b34:
>>>>>>> Iteration   1: 46764.923 ops/ms
>>>>>>> Iteration   2: 46695.196 ops/ms
>>>>>>> Iteration   3: 46647.386 ops/ms
>>>>>>> Iteration   4: 46806.854 ops/ms
>>>>>>> Iteration   5: 46787.259 ops/ms
>>>>>>> Iteration   6: 46788.196 ops/ms
>>>>>>> Iteration   7: 46797.500 ops/ms
>>>>>>> Iteration   8: 46784.237 ops/ms
>>>>>>> Iteration   9: 46782.717 ops/ms
>>>>>>> Iteration  10: 46790.678 ops/ms
>>>>>>> Iteration  11: 46785.139 ops/ms
>>>>>>> Iteration  12: 46798.346 ops/ms
>>>>>>> Iteration  13: 46784.595 ops/ms
>>>>>>> Iteration  14: 46770.963 ops/ms
>>>>>>> Iteration  15: 46789.574 ops/ms
>>>>>>> Iteration  16: 46822.452 ops/ms
>>>>>>> Iteration  17: 46813.571 ops/ms
>>>>>>> Iteration  18: 46747.076 ops/ms
>>>>>>> Iteration  19: 46774.254 ops/ms
>>>>>>> Iteration  20: 46779.329 ops/ms
>>>>>>>
>>>>>>> Result : 46775.512 ±(99.9%) 34.788 ops/ms
>>>>>>>    Statistics: (min, avg, max) = (46647.386, 46775.512, 46822.452),
>>>>>>> stdev = 40.061
>>>>>>>    Confidence interval (99.9%): [46740.725, 46810.300]
>>>>>>>
>>>>>>>
>>>>>>> Numbers from 7u40 b34:
>>>>>>> Iteration   1: 9966.052 ops/ms
>>>>>>> Iteration   2: 9967.683 ops/ms
>>>>>>> Iteration   3: 9967.229 ops/ms
>>>>>>> Iteration   4: 9967.266 ops/ms
>>>>>>> Iteration   5: 9937.091 ops/ms
>>>>>>> Iteration   6: 9966.272 ops/ms
>>>>>>> Iteration   7: 9964.679 ops/ms
>>>>>>> Iteration   8: 9966.326 ops/ms
>>>>>>> Iteration   9: 9964.899 ops/ms
>>>>>>> Iteration  10: 9966.920 ops/ms
>>>>>>> Iteration  11: 9963.278 ops/ms
>>>>>>> Iteration  12: 9967.334 ops/ms
>>>>>>> Iteration  13: 9963.351 ops/ms
>>>>>>> Iteration  14: 9968.032 ops/ms
>>>>>>> Iteration  15: 9964.312 ops/ms
>>>>>>> Iteration  16: 9967.080 ops/ms
>>>>>>> Iteration  17: 9965.114 ops/ms
>>>>>>> Iteration  18: 9966.860 ops/ms
>>>>>>> Iteration  19: 9965.375 ops/ms
>>>>>>> Iteration  20: 9966.215 ops/ms
>>>>>>>
>>>>>>> Result : 9964.568 ±(99.9%) 5.743 ops/ms
>>>>>>>    Statistics: (min, avg, max) = (9937.091, 9964.568, 9968.032),
>>>>>>> stdev = 6.613
>>>>>>>    Confidence interval (99.9%): [9958.826, 9970.311]
>>>>>>>
>>>>>>>
>>>>>>> Numbers from http://hg.openjdk.java.net/jdk9/hs-comp/hotspot without
>>>>>>> the y == 2 check:
>>>>>>> Iteration   1: 9966.775 ops/ms
>>>>>>> Iteration   2: 9964.514 ops/ms
>>>>>>> Iteration   3: 9959.708 ops/ms
>>>>>>> Iteration   4: 9965.501 ops/ms
>>>>>>> Iteration   5: 9958.087 ops/ms
>>>>>>> Iteration   6: 9964.471 ops/ms
>>>>>>> Iteration   7: 9964.966 ops/ms
>>>>>>> Iteration   8: 9965.132 ops/ms
>>>>>>> Iteration   9: 9959.055 ops/ms
>>>>>>> Iteration  10: 9964.666 ops/ms
>>>>>>> Iteration  11: 9965.649 ops/ms
>>>>>>> Iteration  12: 9964.309 ops/ms
>>>>>>> Iteration  13: 9966.963 ops/ms
>>>>>>> Iteration  14: 9956.511 ops/ms
>>>>>>> Iteration  15: 9964.881 ops/ms
>>>>>>> Iteration  16: 9966.927 ops/ms
>>>>>>> Iteration  17: 9951.054 ops/ms
>>>>>>> Iteration  18: 9966.512 ops/ms
>>>>>>> Iteration  19: 9967.041 ops/ms
>>>>>>> Iteration  20: 9967.198 ops/ms
>>>>>>>
>>>>>>> Result : 9963.496 ±(99.9%) 3.760 ops/ms
>>>>>>>    Statistics: (min, avg, max) = (9951.054, 9963.496, 9967.198),
>>>>>>> stdev = 4.330
>>>>>>>    Confidence interval (99.9%): [9959.736, 9967.256]
>>>>>>>
>>>>>>>
>>>>>>> Numbers from http://hg.openjdk.java.net/jdk9/hs-comp/hotspot with
>>>>>>> the
>>>>>>> y == 2 check:
>>>>>>> Iteration   1: 276969.757 ops/ms
>>>>>>> Iteration   2: 276809.529 ops/ms
>>>>>>> Iteration   3: 276621.258 ops/ms
>>>>>>> Iteration   4: 276352.094 ops/ms
>>>>>>> Iteration   5: 276922.865 ops/ms
>>>>>>> Iteration   6: 276617.189 ops/ms
>>>>>>> Iteration   7: 276941.087 ops/ms
>>>>>>> Iteration   8: 276215.547 ops/ms
>>>>>>> Iteration   9: 276118.685 ops/ms
>>>>>>> Iteration  10: 276550.807 ops/ms
>>>>>>> Iteration  11: 276773.424 ops/ms
>>>>>>> Iteration  12: 276871.125 ops/ms
>>>>>>> Iteration  13: 276059.947 ops/ms
>>>>>>> Iteration  14: 277109.329 ops/ms
>>>>>>> Iteration  15: 276910.165 ops/ms
>>>>>>> Iteration  16: 276138.922 ops/ms
>>>>>>> Iteration  17: 276083.749 ops/ms
>>>>>>> Iteration  18: 276367.479 ops/ms
>>>>>>> Iteration  19: 276563.471 ops/ms
>>>>>>> Iteration  20: 276022.425 ops/ms
>>>>>>>
>>>>>>> Result : 276550.943 ±(99.9%) 309.657 ops/ms
>>>>>>>    Statistics: (min, avg, max) = (276022.425, 276550.943,
>>>>>>> 277109.329), stdev = 356.601
>>>>>>>    Confidence interval (99.9%): [276241.286, 276860.600]
>>>>>>>


More information about the hotspot-compiler-dev mailing list