RFR: 8189107 - AARCH64: create intrinsic for pow

Fri Aug 17 09:21:33 UTC 2018

On 16/08/18 10:23, Andrew Dinn wrote:
> On 14/08/18 17:24, Dmitrij Pochepko wrote:
>> Benchmarks:
>>
>> I created benchmark to measure Math.pow:
>> http://cr.openjdk.java.net/~dpochepk/8189107/MathBench.java

> Just a preliminary comment before I get back to the code.
> . . .
> I am not clear that it will significantly affect the outcome that the
> two values are equal but the fact that they are all positive is perhaps
> important. It might be worth re-running this benchmark using independent
> ranges for X and Y, including negative values for both:
Ok, let's revise that, now I have engaged my brain with the maths and,
to a lesser degree, the code. Clearly, negative X values are only
interesting for integral Y values (still worth testing) and negative Y
values are not the most important issue, certainly not as important as
/small/ Y values. However, rather than considering those cases directly
I have provided a more general critique below and indicated the range of
tests that are really needed here.

I suggest you read that critique carefully, especially the Why?
arguments which motivate it. However, to summarize, the upshot is:

you need to provide more and better testing, both to validate the
implementation and to better measure performance differences

you need to provide clearer, more complete and more precise
documentation of the generated asm code, including a new C or pseudocode
algorithm that accurately matches the asm, making it clear how the asm
is structured and how it relates to the original C code

Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

----- 8< -------- 8< -------- 8< -------- 8< -------- 8< -------- 8< ---

There are 4 major things that I think are missing in this patch that
really need covering before we can accept it into the code base and
/commit/ to maintaining it. Without these things being addressed I am
not happy about including what is a much more complex implementation
when the performance gain is a few tens of percent. n.b. an order of
magnitude -- even if it was only for certain important, common cases --
would be much more compelling so one of the tasks is to investigate that
option.

The missing elements are as follows:

1) Validation tests which verify that correct results are obtained in
each of the corner cases laid out in the special case rules in the C
code header.

In cases where there is a specific pair of values which triggers the
relevant case processing then each such combination needs to be checked
and the result compared to the one returned by the current algorithm.

In cases where the rule applies to a range of values (e.g. 0 **
+/-anything or -anything ** integer) then a 'representative' range of
input pairs should be used.

So, for example, in the case of 0 ** +/-anything it should be tested
using a few values (e.g. 1.5, 1.7, 255.0, 1000.0, 2.0e-23, 2.7e-120 or
whatever) not covered by the previous case handling.

n.b. those numbers are chosen because, unlike the numbers you chose to
test with, they include exact and inexact even and odd integral and
non-integral representations. This may not matter much here but making a
suitable selection will matter when it comes to driving code through the
various different paths in your main routine and the (mostly)
corresponding paths in the original C code.

Similarly, for -anything ** integer a range of positive and negative
integer Y values should be used in combination with a range of anything
X values to ensure the test exercises as many different paths through
the main body as possible (e.g. large, medium, small and subnormal Y
values). This definitely ought to include cases where overflow and
underflow occur.

Why? Well, we need tests so that we know the new asm is doing the right
thing now and, that it stays that way in case we meet the need to change
generated code later.

2) Validation tests which verify that correct results are obtained for a
representative range of non-special case input pairs that exercise the
main routine.

The same requirement applies here as for the -anything ** integer case
above that the tests should attempt to exercise as may paths through the
code as possible by selecting a range of input pairs that will guarantee
full coverage.

Why? Well, clearly this is not going to give us any confidence about
correctness for the whole range of potential inputs but exercising all
paths through the code in at least one case will find any obvious
blunders in the current code and ensure we don't introduce any obvious
blunders in case we meet the need to change the generated code later.

3) Performance tests which compare performance of this implementation
against the performance obtained using the C code for s selection of
different ranges of possible input pairs, each such range benchmarked
independently.

The problem with the existing 'benchmark' is that it doesn't cover a lot
of the potential range of inputs and it lumps all computations together
even when they might be exercising different paths through the main
routine. This might be worth retaining to compute some sort of average
case (although the selectivity in the input values means it is unlikely
to do that very well). What it does not do is identify specific cases
that exercise an individual paths through the two codes where the new
one may improve (or, maybe, perform worse) than the old one by some
amount that varies path by path. i.e. we don't really know the best or
worst case improvement of this algorithm.

So, for example: it would be useful to know what he performance
difference is when exercising the algorithm for specific X values with a
range of low integral Y values (say 0 < Y < 30, or -5 < Y < 5). It would
probably be useful to test what happens when using subnormal X or very
large X, paired with very small Y. It might also be interesting to know
what happens when using values of X close to 1 and very large Y.
Clearly, there is a limited number of paths through each
implementation's main routine so the test cases should probably be
chosen to ensure specific paths are taken, highlighting what happens
when we exercise that path.

Why? The performance gain reported by the current benchmark is not
necessarily revealing of what happens in the most interesting cases and
is not necessarily representative of what we might expect in real cases.
An argument that derives from clearer benefits in some common, special
cases (e.g. relatively low integer powers) or in less common cases where
performance might still really matter for certain specific algorithms
(e.g. subnormal handling) might make a more compelling case for taking o
the maintenance burden. Of course, a much smaller gain for those cases
might sway the argument the other way.

4) The comments in the code explain some of what is happening but they
do not explain the algorithm being used clearly enough for it to be
possible to understand and maintain the code as is.

This problem is mitigated to a small degree by the inclusion of the
original C code but, unfortunately, that code is not really good enough.
It presents a specific encoding of the mathematics which justifies this
algorithm but one which has significant differences to the encoding
employed in the generated asm. There are several things at play here so
let me articulate them separately.

Firstly, the C code is operating on a different data representation to
the generated asm. It uses two pairs of 32-bit signed words hx, hy and
ix, iy for the high 32 bits of X and Y and high 31 bits of X and Y,
respectively and uses a pair of unsigned words lx, ly for the low 32
bits of X and Y. The generated code operates on 64 bit values in 64 bit
registers.

That does not in itself invalidate comparisons of the two algorithms
given that all the relevant bits contained in ix, iy, hx, hy, lx and ly
are contained in the 64 bit X and Y registers. However, since the asm
instructions in the generated code operate on the whole, signed
enchilada and the C code operates independently on the upper and lower
halves in signed or absolute versions this causes a significant
divergence in how individual mathematical steps are encoded as C vs asm
instructions. If nothing else comparisons differ as to their signedness
and often operate in two stages in the C code instead of one in the asm.
Tests also get performed in different orders as a result of this
different representation. That makes the C code very far from clear as
an explanation of what the asm is doing.

Secondly, the decision tree used by the generated asm to handle the
various corner cases mentioned in the header has been significantly
re-arranged in the generated code. The C code tests for and handles
various of the corner cases in a very specific order that matches the
declaration of he corner cases i.e. it first checks for

  y == +/-0 ==> return special result
  y == NaN || x == NaN ==> return special result

then it computes yisint in { non_integer, odd_integer, even_integer }
then it tests for and handles these corner cases

  y == +/-Inf, x == 1 ==> return special result
  y == +/-Inf, x < 1 ==> return special result
  y == +/-Inf, x > 1 ==> return special result

  y == +/-1 ==> return special result
  y == 2 ==> return special result
  y == 0.5 ==> return special result

and so on

The generated code uses an alternative, branching decision tree which, I
admit, looks to be an improvement. However, as a result the original C
code now fails to provide an explanatory algorithm for what has been
implemented.

I'll grant that the block comments at each branch point do make it clear
what cases are being considered in each subtree of the generated code.
However, it would help enormously if the original C code was
restructured to provide a new algorithm that reflected the alternative
control flow that has actually been implemented (if preferred, the
alternative algorithm could be provided as pseudo-code).

This would have two benefits. Firstly, it would provide a much better
summary of what the assembly code is doing. Secondly, it would allow the
new control flow to be compared with the old one (both versions are
needed), making it clearer whether the revised control flow really
catches and handles the various corner cases in a way that is equivalent
to the original C code.

Providing a new C or pseudo-code algorithm would also help with the
remaining serious difficulty. It would also help clarify the asm in
cases where the tests in the original C code are very different to those
made in the asm. Let's consider an example that makes this very clear.

In the generated code a branch to X_IS_SPECIAL gets taken when |X| is
Nan or Inf and Y is not +/-0, 1, Nan or +/-Inf.

    // Assume Y == NaN/INF already checked
    bind(X_IS_SPECIAL);
      cmp(zr, X, LSL, 12);
      br(NE, RET_NAN);
      // X is +/- INF
      // Already covered: Y == +/-0, 1, NaN and INF
      // Y > 0 => return +INF
      // Y < 0 => return +0
      tbnz(X, 63, X_IS_MINF_Y_IS_NUMBER);
      . . .

If the code falls through to the branch target X_IS_MINF_Y_IS_NUMBER we
reach the following code:

    bind(X_IS_MINF_Y_IS_NUMBER);
      // tmp2 = unbiased exponent
      rbit(tmp4, rscratch2);
      orr(tmp4, tmp4, 0x800);
      mov(tmp5, 52);
      clz(tmp4, tmp4);
      sub(tmp10, tmp2, 1023);             // biased exponent
      sub(tmp4, tmp5, tmp4);
      cmp(tmp10, tmp4);
      br(EQ, X_IS_MINF_Y_IS_ODD);
      // X == -INF && "Y < 0 but not a finite odd integer" => return +0
      // X == -INF && "Y > 0 but not a finite odd integer" => return +INF
      tbz(rscratch2, 63, RETURN_PINF);
      fmovd(v0, 0.0d);
      ret(lr);

Now the C code for handling this case has already computed whether or
not Y is an even int, an odd int or not integral using a very different
test. First it compares iy against 0x43400000 to see if the Y value has
a biased exponent > 1023 + 52 (= 0x434) or, equivalently, an unbiased
exponent > 52. This guarantees that Y is even, whatever is in the 52
fractional bits. Failing that it compares iy against 0x3ff00000 i.e.
sees if the exponent of Y is >= 0 (obviously, necessary for the value to
be integral) and then does some clever bit shifting on ly or iy using
that exponent to see if there are any bits in the lower or upper word
that would make y non-integral or odd.

Clearly, the code above is using a somewhat different algorithm even
though it can be correlated with some of the C code (for one thing, it
is only checking for y both odd and integral).

The rbit is used to allow trailing zeroes in the fractional part to eb
counted using clz (count leading zeroes). The orr at bit 11 of the
reversed value of Y ensures that clz never counts leading zeroes beyond
bit 11 -- equivalently it only counts trailing zeroes in the 52-bit
fractional part of Y. So, if the value, let's call it T, returned by clz
is the number of trailing zero bits in the fractional part then,
clearly, for Y to be odd the unbiased exponent has to be exactly (52 -
T). i.e. to verify that Y is odd and integral we need to test that:

  (biased_exp - 1024) == lowest_fraction_bit_index(Y) // in range 0..52
                      == min(trailing_zeroes(Y), 52)
                      == trailing_zeroes(Y | (1 << 53))
                      == leading_zeroes(reverse(Y) | (1 << 11))

Obviously, a minor variant of this can be used to detect even integer
values of Y and non-integer Y.

Why? An explanation of what code like the above example is doing and why
it is correct is critical to allowing whoever has to maintain it to
understand how it works without having to put days of analysis in
whenever they have to look at it. Without it this it will be a fly trap
(exactly as I explained wrt the sin and cos code). The original C code
provides almost no help here. A pseudo-code algorithm for the asm is the
minimum that is needed. Tricks like the one above require an explanatory
comment that provide arithmetical/mathematical background that explains
what the pseudo-code is doing and why.