RFR: 8189107 - AARCH64: create intrinsic for pow

Fri Oct 26 12:21:36 UTC 2018

Hi Andrew,

I prepared new patch with all your comments addressed.

Changes:

- fixed huge y case

- updated scalbn block to use approach you suggested

- updated optimized algorithm pseudo code to reflect all changes, also 
fixed few minor pseudo code issues

- rearranged assembly code:

-- renamed registers variables to have names correlating to C-code where 
applicable

-- introduced local blocks with local labels and register mappings where 
applicable

-- removed unnecessary register remappings

-- moved odd/even check code generation to separate function

-- removed K_IS_0 and K_IS_1 branches code repetition

- added simple jtreg test

Regarding testing:

I found that one of tests I was using ( 
http://hg.openjdk.java.net/jdk/jdk/file/tip/test/jdk/java/lang/Math/PowTests.java) 
actually skip non-corner-case testing. Still, this test has very good 
set of input values, which can be re-used. I created enhancement for 
this test to actually test non-corner-case values ( 
https://bugs.openjdk.java.net/browse/JDK-8211002 ). I modified this test 
locally to have everything testing and it passed.

I also slightly modified bruteforce test I've sent before to iterate 
through X values of range (0, 1E300) using ~10^8 ulps step with several 
Y values(tiny/normal/large/huge), which took several days to run. All 
seems fine.

updated patch: http://cr.openjdk.java.net/~dpochepk/8189107/webrev.04

Thanks,

Dmitrij

On 21/09/2018 6:01 PM, Andrew Dinn wrote:
> Hi Dmitrij,
>
> As promised here is the review (after sig) based on webrev.02. The
> review first describes the problems I have identified. It then continues
> with recommendations for (extensive) rework. Since it is based on
> webrev.02 you will have to make some allowance for the changes you
> introduced in your webrev.03
>
> I have revised you webrev to include fixes for the two errors I
> identified and a new version is available here
>
>    http://cr.openjdk.java.net/~adinn/8189107/webrev.03/
>
> The webrev includes my recommended fix for to the scalbn code in
> macroAssembler_pow.cpp and a correction to the declaration of table
> _pow_coef1 in stubRoutines_aarch64.cpp. I explain these changes below.
>
> I have also uploaded a commented version of the original algorithm and a
> revised algorithm based on your code here
>
>    http://cr.openjdk.java.net/~adinn/8189107/original_algorithm.txt
>    http://cr.openjdk.java.net/~adinn/8189107/revised_algorithm.txt
>
> You have seen the original algorithm modulo a few tweaks that emerged in
>   creating the revised version. The revised algorithm /faithfully/ models
> your webrev.02 code (with a couple of notable exceptions that relate to
> problems described below). That faithful modelling of the code includes
> retaining the order of computation of all values. In particular, it
> models early computation of some data that you appear to have introduced
> in order to pipeline certain operations.
>
> At the same time, the algorithm also introduces a much more coherent
> control structure (inserting 'if then else' in place of GOTO everywhere
> it is possible) and a nested /block structure/ (none of this required
> reordering btw). It profits from this to introduce block local
> declarations which scope the values computed and used at successive
> steps. As far as possible the revised algorithm employs the same naming
> convention as the original algorithm (I'll explain more on that in the
> detailed feedback below).
>
> Why does this matter? Well, once the errors get fixed, by far the
> biggest remaining problems with the existing code are 1) its unclear
> control structure and 2) its incoherent allocation of data to registers.
> The intention of providing block structure and block local use of
> variables in the revised algorithm is to enable introduction of similar
> block structuring and block local declarations for registers and branch
> labels in the generator code. In particular, that will allow the
> generator code to be rewritten to use symbolic names for registers
> consistently throughout the function.
>
> So, a small number of register mappings for variables that are global to
> the algorithm will need to be be retained at the start of the function.
> However, they will use meaningful names like x, exp_mask, one_d, result
> etc instead of the completely meaningless aliases, tm1, tmp2 etc, that
> you provided. Similarly, some of the label declarations (mostly for exit
> cases) will need to be retained at the top level.
>
> However, most register mappings will be for variables that are local to
> a block. So, they will need to be declared at the start of that block
> making it clear where they are used and where they are no longer in use.
> Similarly most label declarations will be only need to be declared at
> the start of the immediately enclosing block that generates the code
> identified by the label. This will ensure declarations are close to
> their point of use and are not in scope after they become redundant (or
> at least not for very long).
>
> Register mappings for variables declared in an outer block that are live
> across nested blocks will then be able to be used consistently in those
> inner blocks while clearly identifying precisely what values are being
> generated, used or updated. The same applies for label declarations.
> They can be used as branch targets in nested blocks but will not be
> visible in outer scopes or sibling scopes.
>
> Where possible the revised algorithm employs the same naming convention
> as the original algorithm for the values it operates on -- with one
> important modification. A suffix convention has been adopted to clarify
> some ambiguities present in the original. For example, this allows us to
> distinguish the double value x from its long bits representation x_r,
> its 32 bit high word x_h or its 32 bit positive high word ix_h. The
> algorithm also employs block local declarations to scope intermediate
> values, employing names starting with 'tmp'. These are often introduced
> in small, local blocks, allowing the same names tmp1, tmp2 etc to be
> reused without ambiguity.
>
> So, I hope it is clear how you can use this algorithm to rewrite the
> generator code to be much more readable and maintainable -- that being
> the ultimate goal here. I'm not willing to let the code be committed
> without this restructuring taking place.
>
> regards,
>
>
> Andrew Dinn
> -----------
> Senior Principal Software Engineer
> Red Hat UK Ltd
> Registered in England and Wales under Company Registration No. 03798903
> Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander
>
> Problems
> --------
>
> 1) Error in Y_IS_HUGE case
>
> The vector maths that computes the polynomial for this case expects to
> load the first 6 coefficients in table _pow_coef1 in the permuted order
> (0.25, -ivln2, -0.333333333333, 0.5, ivln2_l, ivln2_h). However, the
> table declaration in stubRoutines_aarch64.cpp declares them in the order
> (-0.333333333333, 0.25, 0.5, -ivln2, ivln2_l, ivln2_h).
>
> 2) scalbn implementation is wrong
>
> The original code was using this algorithm
>
> 1  if (0 >= (j_h >> 20))
> 2    tmp1 = n' - 1023               // bias n as an exponent
> 3    tmp2 = (tmp1 & 0x3ff) <<  52   // install as high bits
> 4    two_n_d = LONG_TO_DOUBLE(tmp2)
> 5    z = z * two_n_d              // rescale
> 6    b(CHECK_RESULT_NEGATION);
>
> The problems are:
>
> line 1: as you spotted the test was wrongly implemented using cmp not
> cmpw (j_h is computed using an int add so sign extending as a long fails
> to retain the overflow bit).
>
> line 2: n is the unbiased exponent -- rebiasing it requires adding 1023,
> not subtracting it
>
> line 3: when we hit underflow the unbiased exponent is in the range
> [-1024, -1075]. So, after correcting the sub to an add the exponent in
> tmp1 will be negative (that is precisely the case the if test looks
> for).  Installing the top 12 bits exponent of this negative value into
> the high  bits of a double gives a float with unbiased exponent in range
> [970,  1023] i.e. a very high positive power of 2 rather than a very low
>   negative one. The result is by by about 300 orders of  magnitude!
>
> line 6: As you spotted, the multiply here updates the register storing z
> rather than installing the result in v0
>
> I explain all this not to point out why it is wrong but to show how your
> original version can be salvaged with a few small changes. Basically we
> want to multiply z by 2^n to get the result where n lies between -1023
> and -1075. Depending on the values of z and n the result will be either
> a subnormal double or +0. So, the simplest solution is to do the
> multiply in two stages. Here is a revised algorithm:
>
>    if (0 >= (j_h >> 20)) {
>      double two_n_r  // power of 2 as long bits mapped to r2
>      long biased_n   // n' - 1023 mapped to r2
>      double two_n_d  // used to rescale z to underflow value
>                      // mapped to v17
>      // split the rescale into two steps: 2^-512 then the rest
>      n = n + 512
>      two_n_r = 0x1ff << 52 // high bits for 2^-512
>      two_n_d = LONG_TO_DOUBLE(two_n_r)
>      z' = z' * two_n_d
>      biased_n = n + 1023   // bias remainder -- will now be positive!
>      two_n_r = (biased_n & 0x3ff) << 52 // high bits for 2^n
>      two_n_d = LONG_TO_DOUBLE(two_n_r)
>      result = z' * two_n_d
>     } else {
>       ...
>
> The code for this is:
>
>      cmpw(zr, tmp10, ASR, 20);
>      br(LT, NO_SCALBN);
>      // n = tmp1
>      // rescale first by 2^-512 and then by the rest
>      addw(tmp1, tmp1, 512);        // n -> n + 512
>      movz(tmp2, 0x1FF0, 48);
>      fmovd(v17, tmp2);             // 2^-512
>      fmuld(v18, v18, v17);         // z = z * 2^-512
>      addw(tmp2, tmp1, 1023);       // biased exponent
>      ubfm(tmp2, tmp2, 12, 10);
>      fmovd(v17, tmp2);             // 2^n
>      fmuld(v0, v18, v17);          // result = z * 2^n
>      b(CHECK_RESULT_NEGATION);
>    bind(NO_SCALBN);
>          . . .
>
> I think this is simpler than your alternative. I checked it on several
> test cases and it agrees with the existing StrictMath code.
>
> 3) Use of Estrin form in place of Horner form
>
> I would prefer not to use this without a proof that the re-ordered
> computation does not introduce rounding errors. I doubt that it will and
> I suspect that even if it did the error will be very small, certainly
> small enough that the leeway between what is expected of StrictMath.pow
> and what is allowed for in Math.pow will not cause anyone's expectations
> to be challenged. However, even so I'd prefer not to surprise users.
> Anyway, if Andrew Haley really wants this to be adopted then I'll accept
> his override and you can leave it in Estrin form.
>
> 4) Repetition of code in K_IS_0/K_IS_1 branches
>
> In the Y_IS_NOT_HUGE block 15/17 instructions are generated in the if
> and else branches for k == 0 and k == 1, implementing what is almost
> exactly the same computation. The two generated sequences differ very
> slightly. In the k == 1 case dp_h and dp_l need to be folded into the
> computation to subtract ln2(1.5) from the result while in the k == 0
> case dp_h and dp_l are both 0.0 and can be ignored.
>
> The most important difference is the need to load dp_l/dp_h from the
> coefficient table in one branch while the other merely moves forward the
> cursor which points at the table. The other differences consist of two
> extra operations in the k == 1 branch, an extra fmlavs and an extra
> faddd, which fold in the dp_l and dp_h values.
>
> An alternative would be to use common code for the computation which
> always perform the extra fmlavs and faddd. The revised algorithm
> describes precisely this alternative. To make it work the k = 0 branch
> needs to install 0.0 in the registers used for dp_h and dp_k (not
> necessarily by loading from the table). This shortens the branches,
> relocating 15 common instructions after the branch
>
> As far as code clarity is concerned it is easier to understand and
> maintain if the common code is generated only once. As for performance,
> I believe this trade off of a few more cycles against code size is also
> a better choice. Shaving instructions may give a small improvement in a
> benchmark, especially if the benchmark repeatedly runs with values that
> exercise only one of the paths. However, in real use the extra code size
> from the duplicated code is likely to push more code out of cache. Since
> this is main path code that is actually quite likely to happen.
>
> So, I would like to see the duplication removed unless you can make a
> really strong case for keeping it. If you can provide such a reason then
> an explanation why the duplication is required needs to be provided in
> the revised algorithm and the code and the algorithm need sot be updated
> to specify both paths.
>
> 4) Repetition of odd/even tests in exit cases.
>
> The original algorithm takes the hit of computing the even/odd/fraction
> status of y inline, i.e. in the main path, during special case checks.
> That happens even if the result is not used later. You have chosen to do
> it at the relevant exits, resulting in more repeated code.
>
> These cases will likely be a rare path so the issue of extra code size
> is not going to be very important relative to the saved cycles. However,
> the replication of inline code is a maintenance issue.
>
> It would be better to use a separate function to generate the common
> code that computes the test values (lowest non-zero bit idx and exponent
> of y) avoiding any need to read, check and fix the generator code in
> different places. Please update the code accordingly.
>
> 5) Test code
>
> You need to include a test as part of this patch which checks the
> outputs are correct for a few selected inputs that exercise the
> underflow and Y_IS_HUGE branches. I adapted the existing Math tests to
> include these extra checks:
>
>      public static int testUnderflow() {
>          int failures = 0;
>          failures += testPowCase(1.5E-2D, 170D, 8.6201344461973E-311D);
>          failures += testPowCase(1.55E-2D, 171.3D,  1.00883443217485E-310D);
>          // failures += testPowCase(150D, 141.6D, 1.3630829399139085E308);
>          return failures;
>      }
>
>      public static int testHugeY() {
>          int failures = 0;
>          long yl = 0x1_1000_0000L;
>          failures += testPowCase(1.0000000001, (double)yl,
> 1.5782873649891997D);
>          failures += testPowCase(1.0000000001, (double)yl + 0.3,
> 1.5782873650365483D);
>          return failures;
>
> You don't have to add this to the existing math test suite. A simple
> standalone test which checks that the StrictMath and Math outputs agree
> on these inputs is all that is needed.
>
>
> Rework
> ------
>
> 1) Please check that the revised algorithm I have provided accurately
> reflects the computations performed by your code. That will require
> changing the code to deal with the error cases 1, 2, 4 and 5 above. If
> you stick with the Estrin form in case 3 then ensure the algorithm is
> correct. If you stick with Horner form then update the algorithm so it
> is consistent.
>
> The algorithm currently details all mappings of variable to registers.
> That is provided as a heuristic which i) allowed me to track usages when
> writing up the algorithm and ii) will allow you to analyze the current
> register mappings and rationalize them. Once you have a more coherent
> strategy for allocating variables to registers details of the mapping
> can and should be removed.
>
> As mentioned, the algorithm uses a suffix notation to distinguish
> different values where there is some scope for ambiguity. The suffices
> are used as follows
>
>    1) '_d' double values (mapped to d registers)
>
>    2) '_hd' and '_ld' hi/lo double pairs used to retain accuracy (mapped
> to independent d registers)
>
>    3) '_d2' double vectors (mapped to 2d v registers)
>
>    4) '_r' long values that represent the long version of a double value
> (mapped to general x registers)
>
>    5) '_h' int values that represent the high word of a double value
> (mapped to general w registers)
>
> In many unambiguous cases a suffix is omitted e.g. x, y, k, n.
>
> 2) Sort out inconsistent mappings and unnecessary remappings
>
> One of the problems I faced in writing up the algorithm is that some of
> your register use appears to be inconsistent -- the same 'logical'
> variable is mapped to different registers at different points in the code.
>
> In some cases, this reflects different use of the same name for
> different quantities calculated at different stages in the algorithm
> (for example, z is used to name both log2(x) and then reused later for
> the fractional part of log2(x)). Most of those reuses are actually
> catered for by declaring the var in one block and then redeclaring it in
> a different sibling block. If this block structure is replicated in the
> code then it will enable z to be mapped differently for each different
> scope. However, that's not the preferred outcome.  It would make the
> code easier to follow if every variable named in the algorithm was
> always mapped to the same register unless there was a clear need to
> relocate it.
>
> There are also cases where a remapping is performed (without any
> discernible reason) within the same scope or within nested scopes! For
> example, after the sign of x_r has been noted (and, possibly, it's sign
> bit removed) the resulting absolute value of x_r is moved from r0 to r1
> (using an explicit mov). There doesn't seem to be any need to do this.
> Likewise, in the COMPUTE_POW_PROCEED block the value for z stored in v0
> is reassigned to v18 in the same block!
>
> I have flagged these remappings with a !!! comment and avoided the
> ambiguity that arises by adding a ' to the remapped value (x_r', z') and
> using it thereafter. This ensures that uses of the different versions of
> the value located in different registers can be distinguished.
>
> An example of remapping in nested scope is provided by p_hd and p_ld. At
> the outermost scope these values are mapped to v5 and b6. However, in
> the K_IS_SET block they are stored in v17 and v16 (values for v and u,
> local to the same block, are placed in v5 and v6 so there is no reason
> why the outer scope mapping could not have been retained).
>
> I'd like to see a much more obvious and obviously consistent plan
> adopted for mappings before the code is committed.
>
> 3) Insert my original and revised algorithms into your code in place of
> the current ones.
>
> 4) Change the code to introduce local blocks as per the revised algorithm
>
> This should mostly be a simple matter of introducing braces into the
> code at strategic locations (although please re-indent).
>
> 5) Change the code to use symbolic names for register arguments and
> declare those names as Register or FloatRegister in the appropriate
> local scope
>
> The main point of employing consistent, logical names for values defined
> in the algorithm  is to allow registers employed in the code to be
> identified using meaningful names rather than using r0, r1, v0, v1,
> tmp1, tmp2 etc.
>
> So, for example at the top level of the routine you need to declare
> register mappings for the function global variables and then use them in
> all the generated instructions e.g. the code should look like
>
>    // double x      // input arg
>    // double y      // input arg
>    FloatRegister x = v0;
>    FloatRegister y = v1;
>    // long y_r      // y as long bits
>    Register y_r = rscratch2
>    // long x_r      // x as long bits
>    Register x_r = r0
>    // double one_d  // 1.0d
>    FloatRegister one_d = v2
>
>    . . .
>
>    // y_r = DOUBLE_TO_LONG(y)
>    fmovd(y_r, y);
>    // x_r = DOUBLE_TO_LONG(x)
>    fmovd(x_r, x)
>
>    . . .
>
> Similarly, in nested blocks you need to introduce block local names for
> registers and then use them in the code. For example in the
> SUBNORMAL_HANDLED block
>
>    bind(SUBNORMAL_HANDLED);
>    // block introduced to scope vars/labels in this region
>    {
>      Label K_IS_SET;
>      // int ix_h      // |x| with exponent rescaled so 1 =< ix <
>      Register ix_h = r2;
>      // int k         // range 0 or 1 for poly expansion
>      Register k = r7
>      // block introduced to scope vars/labels in this subregion
>      {
>        // int x_h        // high word of x
>        Register x_h = r2
>        //int mant_x_h   // mantissa bits of high word of
>        Register mant_x_h = r10
>
>        . . .
>
>        // x_h = (x_r' >> 32)
>        lsr(x_h, x_r, 32);
>        // mant_x_h = (x_r >> 32) & 0x000FFFFF
>        ubfx(mant_x_h, x_r, 32, 20); // i.e. top 20 fractions bits of x
>
>        . . .
>
>      }
>      bind(K_IS_SET);
>
>    . . .
>
> You should be able to hang the code directly off the algorithm as shown
> above, making it clear that it matches the revised algorithm and
> allowing meaningful comparison with the original.
>
> If you have changed the code in your latest revisions then please update
> the algorithm accordingly to ensure they continue to correspond with
> each other.