RFR: 8189107 - AARCH64: create intrinsic for pow

Fri Sep 21 15:01:57 UTC 2018

Hi Dmitrij,

As promised here is the review (after sig) based on webrev.02. The
review first describes the problems I have identified. It then continues
with recommendations for (extensive) rework. Since it is based on
webrev.02 you will have to make some allowance for the changes you
introduced in your webrev.03

I have revised you webrev to include fixes for the two errors I
identified and a new version is available here

  http://cr.openjdk.java.net/~adinn/8189107/webrev.03/

The webrev includes my recommended fix for to the scalbn code in
macroAssembler_pow.cpp and a correction to the declaration of table
_pow_coef1 in stubRoutines_aarch64.cpp. I explain these changes below.

I have also uploaded a commented version of the original algorithm and a
revised algorithm based on your code here

  http://cr.openjdk.java.net/~adinn/8189107/original_algorithm.txt
  http://cr.openjdk.java.net/~adinn/8189107/revised_algorithm.txt

You have seen the original algorithm modulo a few tweaks that emerged in
 creating the revised version. The revised algorithm /faithfully/ models
your webrev.02 code (with a couple of notable exceptions that relate to
problems described below). That faithful modelling of the code includes
retaining the order of computation of all values. In particular, it
models early computation of some data that you appear to have introduced
in order to pipeline certain operations.

At the same time, the algorithm also introduces a much more coherent
control structure (inserting 'if then else' in place of GOTO everywhere
it is possible) and a nested /block structure/ (none of this required
reordering btw). It profits from this to introduce block local
declarations which scope the values computed and used at successive
steps. As far as possible the revised algorithm employs the same naming
convention as the original algorithm (I'll explain more on that in the
detailed feedback below).

Why does this matter? Well, once the errors get fixed, by far the
biggest remaining problems with the existing code are 1) its unclear
control structure and 2) its incoherent allocation of data to registers.
The intention of providing block structure and block local use of
variables in the revised algorithm is to enable introduction of similar
block structuring and block local declarations for registers and branch
labels in the generator code. In particular, that will allow the
generator code to be rewritten to use symbolic names for registers
consistently throughout the function.

So, a small number of register mappings for variables that are global to
the algorithm will need to be be retained at the start of the function.
However, they will use meaningful names like x, exp_mask, one_d, result
etc instead of the completely meaningless aliases, tm1, tmp2 etc, that
you provided. Similarly, some of the label declarations (mostly for exit
cases) will need to be retained at the top level.

However, most register mappings will be for variables that are local to
a block. So, they will need to be declared at the start of that block
making it clear where they are used and where they are no longer in use.
Similarly most label declarations will be only need to be declared at
the start of the immediately enclosing block that generates the code
identified by the label. This will ensure declarations are close to
their point of use and are not in scope after they become redundant (or
at least not for very long).

Register mappings for variables declared in an outer block that are live
across nested blocks will then be able to be used consistently in those
inner blocks while clearly identifying precisely what values are being
generated, used or updated. The same applies for label declarations.
They can be used as branch targets in nested blocks but will not be
visible in outer scopes or sibling scopes.

Where possible the revised algorithm employs the same naming convention
as the original algorithm for the values it operates on -- with one
important modification. A suffix convention has been adopted to clarify
some ambiguities present in the original. For example, this allows us to
distinguish the double value x from its long bits representation x_r,
its 32 bit high word x_h or its 32 bit positive high word ix_h. The
algorithm also employs block local declarations to scope intermediate
values, employing names starting with 'tmp'. These are often introduced
in small, local blocks, allowing the same names tmp1, tmp2 etc to be
reused without ambiguity.

So, I hope it is clear how you can use this algorithm to rewrite the
generator code to be much more readable and maintainable -- that being
the ultimate goal here. I'm not willing to let the code be committed
without this restructuring taking place.

regards,

Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

Problems
--------

1) Error in Y_IS_HUGE case

The vector maths that computes the polynomial for this case expects to
load the first 6 coefficients in table _pow_coef1 in the permuted order
(0.25, -ivln2, -0.333333333333, 0.5, ivln2_l, ivln2_h). However, the
table declaration in stubRoutines_aarch64.cpp declares them in the order
(-0.333333333333, 0.25, 0.5, -ivln2, ivln2_l, ivln2_h).

2) scalbn implementation is wrong

The original code was using this algorithm

1  if (0 >= (j_h >> 20))
2    tmp1 = n' - 1023               // bias n as an exponent
3    tmp2 = (tmp1 & 0x3ff) <<  52   // install as high bits
4    two_n_d = LONG_TO_DOUBLE(tmp2)
5    z = z * two_n_d              // rescale
6    b(CHECK_RESULT_NEGATION);

The problems are:

line 1: as you spotted the test was wrongly implemented using cmp not
cmpw (j_h is computed using an int add so sign extending as a long fails
to retain the overflow bit).

line 2: n is the unbiased exponent -- rebiasing it requires adding 1023,
not subtracting it

line 3: when we hit underflow the unbiased exponent is in the range
[-1024, -1075]. So, after correcting the sub to an add the exponent in
tmp1 will be negative (that is precisely the case the if test looks
for).  Installing the top 12 bits exponent of this negative value into
the high  bits of a double gives a float with unbiased exponent in range
[970,  1023] i.e. a very high positive power of 2 rather than a very low
 negative one. The result is by by about 300 orders of  magnitude!

line 6: As you spotted, the multiply here updates the register storing z
rather than installing the result in v0

I explain all this not to point out why it is wrong but to show how your
original version can be salvaged with a few small changes. Basically we
want to multiply z by 2^n to get the result where n lies between -1023
and -1075. Depending on the values of z and n the result will be either
a subnormal double or +0. So, the simplest solution is to do the
multiply in two stages. Here is a revised algorithm:

  if (0 >= (j_h >> 20)) {
    double two_n_r  // power of 2 as long bits mapped to r2
    long biased_n   // n' - 1023 mapped to r2
    double two_n_d  // used to rescale z to underflow value
                    // mapped to v17
    // split the rescale into two steps: 2^-512 then the rest
    n = n + 512
    two_n_r = 0x1ff << 52 // high bits for 2^-512
    two_n_d = LONG_TO_DOUBLE(two_n_r)
    z' = z' * two_n_d
    biased_n = n + 1023   // bias remainder -- will now be positive!
    two_n_r = (biased_n & 0x3ff) << 52 // high bits for 2^n
    two_n_d = LONG_TO_DOUBLE(two_n_r)
    result = z' * two_n_d
   } else {
     ...

The code for this is:

    cmpw(zr, tmp10, ASR, 20);
    br(LT, NO_SCALBN);
    // n = tmp1
    // rescale first by 2^-512 and then by the rest
    addw(tmp1, tmp1, 512);        // n -> n + 512
    movz(tmp2, 0x1FF0, 48);
    fmovd(v17, tmp2);             // 2^-512
    fmuld(v18, v18, v17);         // z = z * 2^-512
    addw(tmp2, tmp1, 1023);       // biased exponent
    ubfm(tmp2, tmp2, 12, 10);
    fmovd(v17, tmp2);             // 2^n
    fmuld(v0, v18, v17);          // result = z * 2^n
    b(CHECK_RESULT_NEGATION);
  bind(NO_SCALBN);
        . . .

I think this is simpler than your alternative. I checked it on several
test cases and it agrees with the existing StrictMath code.

3) Use of Estrin form in place of Horner form

I would prefer not to use this without a proof that the re-ordered
computation does not introduce rounding errors. I doubt that it will and
I suspect that even if it did the error will be very small, certainly
small enough that the leeway between what is expected of StrictMath.pow
and what is allowed for in Math.pow will not cause anyone's expectations
to be challenged. However, even so I'd prefer not to surprise users.
Anyway, if Andrew Haley really wants this to be adopted then I'll accept
his override and you can leave it in Estrin form.

4) Repetition of code in K_IS_0/K_IS_1 branches

In the Y_IS_NOT_HUGE block 15/17 instructions are generated in the if
and else branches for k == 0 and k == 1, implementing what is almost
exactly the same computation. The two generated sequences differ very
slightly. In the k == 1 case dp_h and dp_l need to be folded into the
computation to subtract ln2(1.5) from the result while in the k == 0
case dp_h and dp_l are both 0.0 and can be ignored.

The most important difference is the need to load dp_l/dp_h from the
coefficient table in one branch while the other merely moves forward the
cursor which points at the table. The other differences consist of two
extra operations in the k == 1 branch, an extra fmlavs and an extra
faddd, which fold in the dp_l and dp_h values.

An alternative would be to use common code for the computation which
always perform the extra fmlavs and faddd. The revised algorithm
describes precisely this alternative. To make it work the k = 0 branch
needs to install 0.0 in the registers used for dp_h and dp_k (not
necessarily by loading from the table). This shortens the branches,
relocating 15 common instructions after the branch

As far as code clarity is concerned it is easier to understand and
maintain if the common code is generated only once. As for performance,
I believe this trade off of a few more cycles against code size is also
a better choice. Shaving instructions may give a small improvement in a
benchmark, especially if the benchmark repeatedly runs with values that
exercise only one of the paths. However, in real use the extra code size
from the duplicated code is likely to push more code out of cache. Since
this is main path code that is actually quite likely to happen.

So, I would like to see the duplication removed unless you can make a
really strong case for keeping it. If you can provide such a reason then
an explanation why the duplication is required needs to be provided in
the revised algorithm and the code and the algorithm need sot be updated
to specify both paths.

4) Repetition of odd/even tests in exit cases.

The original algorithm takes the hit of computing the even/odd/fraction
status of y inline, i.e. in the main path, during special case checks.
That happens even if the result is not used later. You have chosen to do
it at the relevant exits, resulting in more repeated code.

These cases will likely be a rare path so the issue of extra code size
is not going to be very important relative to the saved cycles. However,
the replication of inline code is a maintenance issue.

It would be better to use a separate function to generate the common
code that computes the test values (lowest non-zero bit idx and exponent
of y) avoiding any need to read, check and fix the generator code in
different places. Please update the code accordingly.

5) Test code

You need to include a test as part of this patch which checks the
outputs are correct for a few selected inputs that exercise the
underflow and Y_IS_HUGE branches. I adapted the existing Math tests to
include these extra checks:

    public static int testUnderflow() {
        int failures = 0;
        failures += testPowCase(1.5E-2D, 170D, 8.6201344461973E-311D);
        failures += testPowCase(1.55E-2D, 171.3D,  1.00883443217485E-310D);
        // failures += testPowCase(150D, 141.6D, 1.3630829399139085E308);
        return failures;
    }

    public static int testHugeY() {
        int failures = 0;
        long yl = 0x1_1000_0000L;
        failures += testPowCase(1.0000000001, (double)yl,
1.5782873649891997D);
        failures += testPowCase(1.0000000001, (double)yl + 0.3,
1.5782873650365483D);
        return failures;

You don't have to add this to the existing math test suite. A simple
standalone test which checks that the StrictMath and Math outputs agree
on these inputs is all that is needed.

Rework
------

1) Please check that the revised algorithm I have provided accurately
reflects the computations performed by your code. That will require
changing the code to deal with the error cases 1, 2, 4 and 5 above. If
you stick with the Estrin form in case 3 then ensure the algorithm is
correct. If you stick with Horner form then update the algorithm so it
is consistent.

The algorithm currently details all mappings of variable to registers.
That is provided as a heuristic which i) allowed me to track usages when
writing up the algorithm and ii) will allow you to analyze the current
register mappings and rationalize them. Once you have a more coherent
strategy for allocating variables to registers details of the mapping
can and should be removed.

As mentioned, the algorithm uses a suffix notation to distinguish
different values where there is some scope for ambiguity. The suffices
are used as follows

  1) '_d' double values (mapped to d registers)

  2) '_hd' and '_ld' hi/lo double pairs used to retain accuracy (mapped
to independent d registers)

  3) '_d2' double vectors (mapped to 2d v registers)

  4) '_r' long values that represent the long version of a double value
(mapped to general x registers)

  5) '_h' int values that represent the high word of a double value
(mapped to general w registers)

In many unambiguous cases a suffix is omitted e.g. x, y, k, n.

2) Sort out inconsistent mappings and unnecessary remappings

One of the problems I faced in writing up the algorithm is that some of
your register use appears to be inconsistent -- the same 'logical'
variable is mapped to different registers at different points in the code.

In some cases, this reflects different use of the same name for
different quantities calculated at different stages in the algorithm
(for example, z is used to name both log2(x) and then reused later for
the fractional part of log2(x)). Most of those reuses are actually
catered for by declaring the var in one block and then redeclaring it in
a different sibling block. If this block structure is replicated in the
code then it will enable z to be mapped differently for each different
scope. However, that's not the preferred outcome.  It would make the
code easier to follow if every variable named in the algorithm was
always mapped to the same register unless there was a clear need to
relocate it.

There are also cases where a remapping is performed (without any
discernible reason) within the same scope or within nested scopes! For
example, after the sign of x_r has been noted (and, possibly, it's sign
bit removed) the resulting absolute value of x_r is moved from r0 to r1
(using an explicit mov). There doesn't seem to be any need to do this.
Likewise, in the COMPUTE_POW_PROCEED block the value for z stored in v0
is reassigned to v18 in the same block!

I have flagged these remappings with a !!! comment and avoided the
ambiguity that arises by adding a ' to the remapped value (x_r', z') and
using it thereafter. This ensures that uses of the different versions of
the value located in different registers can be distinguished.

An example of remapping in nested scope is provided by p_hd and p_ld. At
the outermost scope these values are mapped to v5 and b6. However, in
the K_IS_SET block they are stored in v17 and v16 (values for v and u,
local to the same block, are placed in v5 and v6 so there is no reason
why the outer scope mapping could not have been retained).

I'd like to see a much more obvious and obviously consistent plan
adopted for mappings before the code is committed.

3) Insert my original and revised algorithms into your code in place of
the current ones.

4) Change the code to introduce local blocks as per the revised algorithm

This should mostly be a simple matter of introducing braces into the
code at strategic locations (although please re-indent).

5) Change the code to use symbolic names for register arguments and
declare those names as Register or FloatRegister in the appropriate
local scope

The main point of employing consistent, logical names for values defined
in the algorithm  is to allow registers employed in the code to be
identified using meaningful names rather than using r0, r1, v0, v1,
tmp1, tmp2 etc.

So, for example at the top level of the routine you need to declare
register mappings for the function global variables and then use them in
all the generated instructions e.g. the code should look like

  // double x      // input arg
  // double y      // input arg
  FloatRegister x = v0;
  FloatRegister y = v1;
  // long y_r      // y as long bits
  Register y_r = rscratch2
  // long x_r      // x as long bits
  Register x_r = r0
  // double one_d  // 1.0d
  FloatRegister one_d = v2

  . . .

  // y_r = DOUBLE_TO_LONG(y)
  fmovd(y_r, y);
  // x_r = DOUBLE_TO_LONG(x)
  fmovd(x_r, x)

  . . .

Similarly, in nested blocks you need to introduce block local names for
registers and then use them in the code. For example in the
SUBNORMAL_HANDLED block

  bind(SUBNORMAL_HANDLED);
  // block introduced to scope vars/labels in this region
  {
    Label K_IS_SET;
    // int ix_h      // |x| with exponent rescaled so 1 =< ix <
    Register ix_h = r2;
    // int k         // range 0 or 1 for poly expansion
    Register k = r7
    // block introduced to scope vars/labels in this subregion
    {
      // int x_h        // high word of x
      Register x_h = r2
      //int mant_x_h   // mantissa bits of high word of
      Register mant_x_h = r10

      . . .

      // x_h = (x_r' >> 32)
      lsr(x_h, x_r, 32);
      // mant_x_h = (x_r >> 32) & 0x000FFFFF
      ubfx(mant_x_h, x_r, 32, 20); // i.e. top 20 fractions bits of x

      . . .

    }
    bind(K_IS_SET);

  . . .

You should be able to hang the code directly off the algorithm as shown
above, making it clear that it matches the revised algorithm and
allowing meaningful comparison with the original.

If you have changed the code in your latest revisions then please update
the algorithm accordingly to ensure they continue to correspond with
each other.