RFR: 8189107 - AARCH64: create intrinsic for pow
Dmitrij Pochepko
dmitrij.pochepko at bell-sw.com
Fri Oct 26 12:21:36 UTC 2018
Hi Andrew,
I prepared new patch with all your comments addressed.
Changes:
- fixed huge y case
- updated scalbn block to use approach you suggested
- updated optimized algorithm pseudo code to reflect all changes, also
fixed few minor pseudo code issues
- rearranged assembly code:
-- renamed registers variables to have names correlating to C-code where
applicable
-- introduced local blocks with local labels and register mappings where
applicable
-- removed unnecessary register remappings
-- moved odd/even check code generation to separate function
-- removed K_IS_0 and K_IS_1 branches code repetition
- added simple jtreg test
Regarding testing:
I found that one of tests I was using (
http://hg.openjdk.java.net/jdk/jdk/file/tip/test/jdk/java/lang/Math/PowTests.java)
actually skip non-corner-case testing. Still, this test has very good
set of input values, which can be re-used. I created enhancement for
this test to actually test non-corner-case values (
https://bugs.openjdk.java.net/browse/JDK-8211002 ). I modified this test
locally to have everything testing and it passed.
I also slightly modified bruteforce test I've sent before to iterate
through X values of range (0, 1E300) using ~10^8 ulps step with several
Y values(tiny/normal/large/huge), which took several days to run. All
seems fine.
updated patch: http://cr.openjdk.java.net/~dpochepk/8189107/webrev.04
Thanks,
Dmitrij
On 21/09/2018 6:01 PM, Andrew Dinn wrote:
> Hi Dmitrij,
>
> As promised here is the review (after sig) based on webrev.02. The
> review first describes the problems I have identified. It then continues
> with recommendations for (extensive) rework. Since it is based on
> webrev.02 you will have to make some allowance for the changes you
> introduced in your webrev.03
>
> I have revised you webrev to include fixes for the two errors I
> identified and a new version is available here
>
> http://cr.openjdk.java.net/~adinn/8189107/webrev.03/
>
> The webrev includes my recommended fix for to the scalbn code in
> macroAssembler_pow.cpp and a correction to the declaration of table
> _pow_coef1 in stubRoutines_aarch64.cpp. I explain these changes below.
>
> I have also uploaded a commented version of the original algorithm and a
> revised algorithm based on your code here
>
> http://cr.openjdk.java.net/~adinn/8189107/original_algorithm.txt
> http://cr.openjdk.java.net/~adinn/8189107/revised_algorithm.txt
>
> You have seen the original algorithm modulo a few tweaks that emerged in
> creating the revised version. The revised algorithm /faithfully/ models
> your webrev.02 code (with a couple of notable exceptions that relate to
> problems described below). That faithful modelling of the code includes
> retaining the order of computation of all values. In particular, it
> models early computation of some data that you appear to have introduced
> in order to pipeline certain operations.
>
> At the same time, the algorithm also introduces a much more coherent
> control structure (inserting 'if then else' in place of GOTO everywhere
> it is possible) and a nested /block structure/ (none of this required
> reordering btw). It profits from this to introduce block local
> declarations which scope the values computed and used at successive
> steps. As far as possible the revised algorithm employs the same naming
> convention as the original algorithm (I'll explain more on that in the
> detailed feedback below).
>
> Why does this matter? Well, once the errors get fixed, by far the
> biggest remaining problems with the existing code are 1) its unclear
> control structure and 2) its incoherent allocation of data to registers.
> The intention of providing block structure and block local use of
> variables in the revised algorithm is to enable introduction of similar
> block structuring and block local declarations for registers and branch
> labels in the generator code. In particular, that will allow the
> generator code to be rewritten to use symbolic names for registers
> consistently throughout the function.
>
> So, a small number of register mappings for variables that are global to
> the algorithm will need to be be retained at the start of the function.
> However, they will use meaningful names like x, exp_mask, one_d, result
> etc instead of the completely meaningless aliases, tm1, tmp2 etc, that
> you provided. Similarly, some of the label declarations (mostly for exit
> cases) will need to be retained at the top level.
>
> However, most register mappings will be for variables that are local to
> a block. So, they will need to be declared at the start of that block
> making it clear where they are used and where they are no longer in use.
> Similarly most label declarations will be only need to be declared at
> the start of the immediately enclosing block that generates the code
> identified by the label. This will ensure declarations are close to
> their point of use and are not in scope after they become redundant (or
> at least not for very long).
>
> Register mappings for variables declared in an outer block that are live
> across nested blocks will then be able to be used consistently in those
> inner blocks while clearly identifying precisely what values are being
> generated, used or updated. The same applies for label declarations.
> They can be used as branch targets in nested blocks but will not be
> visible in outer scopes or sibling scopes.
>
> Where possible the revised algorithm employs the same naming convention
> as the original algorithm for the values it operates on -- with one
> important modification. A suffix convention has been adopted to clarify
> some ambiguities present in the original. For example, this allows us to
> distinguish the double value x from its long bits representation x_r,
> its 32 bit high word x_h or its 32 bit positive high word ix_h. The
> algorithm also employs block local declarations to scope intermediate
> values, employing names starting with 'tmp'. These are often introduced
> in small, local blocks, allowing the same names tmp1, tmp2 etc to be
> reused without ambiguity.
>
> So, I hope it is clear how you can use this algorithm to rewrite the
> generator code to be much more readable and maintainable -- that being
> the ultimate goal here. I'm not willing to let the code be committed
> without this restructuring taking place.
>
> regards,
>
>
> Andrew Dinn
> -----------
> Senior Principal Software Engineer
> Red Hat UK Ltd
> Registered in England and Wales under Company Registration No. 03798903
> Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander
>
> Problems
> --------
>
> 1) Error in Y_IS_HUGE case
>
> The vector maths that computes the polynomial for this case expects to
> load the first 6 coefficients in table _pow_coef1 in the permuted order
> (0.25, -ivln2, -0.333333333333, 0.5, ivln2_l, ivln2_h). However, the
> table declaration in stubRoutines_aarch64.cpp declares them in the order
> (-0.333333333333, 0.25, 0.5, -ivln2, ivln2_l, ivln2_h).
>
> 2) scalbn implementation is wrong
>
> The original code was using this algorithm
>
> 1 if (0 >= (j_h >> 20))
> 2 tmp1 = n' - 1023 // bias n as an exponent
> 3 tmp2 = (tmp1 & 0x3ff) << 52 // install as high bits
> 4 two_n_d = LONG_TO_DOUBLE(tmp2)
> 5 z = z * two_n_d // rescale
> 6 b(CHECK_RESULT_NEGATION);
>
> The problems are:
>
> line 1: as you spotted the test was wrongly implemented using cmp not
> cmpw (j_h is computed using an int add so sign extending as a long fails
> to retain the overflow bit).
>
> line 2: n is the unbiased exponent -- rebiasing it requires adding 1023,
> not subtracting it
>
> line 3: when we hit underflow the unbiased exponent is in the range
> [-1024, -1075]. So, after correcting the sub to an add the exponent in
> tmp1 will be negative (that is precisely the case the if test looks
> for). Installing the top 12 bits exponent of this negative value into
> the high bits of a double gives a float with unbiased exponent in range
> [970, 1023] i.e. a very high positive power of 2 rather than a very low
> negative one. The result is by by about 300 orders of magnitude!
>
> line 6: As you spotted, the multiply here updates the register storing z
> rather than installing the result in v0
>
> I explain all this not to point out why it is wrong but to show how your
> original version can be salvaged with a few small changes. Basically we
> want to multiply z by 2^n to get the result where n lies between -1023
> and -1075. Depending on the values of z and n the result will be either
> a subnormal double or +0. So, the simplest solution is to do the
> multiply in two stages. Here is a revised algorithm:
>
> if (0 >= (j_h >> 20)) {
> double two_n_r // power of 2 as long bits mapped to r2
> long biased_n // n' - 1023 mapped to r2
> double two_n_d // used to rescale z to underflow value
> // mapped to v17
> // split the rescale into two steps: 2^-512 then the rest
> n = n + 512
> two_n_r = 0x1ff << 52 // high bits for 2^-512
> two_n_d = LONG_TO_DOUBLE(two_n_r)
> z' = z' * two_n_d
> biased_n = n + 1023 // bias remainder -- will now be positive!
> two_n_r = (biased_n & 0x3ff) << 52 // high bits for 2^n
> two_n_d = LONG_TO_DOUBLE(two_n_r)
> result = z' * two_n_d
> } else {
> ...
>
> The code for this is:
>
> cmpw(zr, tmp10, ASR, 20);
> br(LT, NO_SCALBN);
> // n = tmp1
> // rescale first by 2^-512 and then by the rest
> addw(tmp1, tmp1, 512); // n -> n + 512
> movz(tmp2, 0x1FF0, 48);
> fmovd(v17, tmp2); // 2^-512
> fmuld(v18, v18, v17); // z = z * 2^-512
> addw(tmp2, tmp1, 1023); // biased exponent
> ubfm(tmp2, tmp2, 12, 10);
> fmovd(v17, tmp2); // 2^n
> fmuld(v0, v18, v17); // result = z * 2^n
> b(CHECK_RESULT_NEGATION);
> bind(NO_SCALBN);
> . . .
>
> I think this is simpler than your alternative. I checked it on several
> test cases and it agrees with the existing StrictMath code.
>
> 3) Use of Estrin form in place of Horner form
>
> I would prefer not to use this without a proof that the re-ordered
> computation does not introduce rounding errors. I doubt that it will and
> I suspect that even if it did the error will be very small, certainly
> small enough that the leeway between what is expected of StrictMath.pow
> and what is allowed for in Math.pow will not cause anyone's expectations
> to be challenged. However, even so I'd prefer not to surprise users.
> Anyway, if Andrew Haley really wants this to be adopted then I'll accept
> his override and you can leave it in Estrin form.
>
> 4) Repetition of code in K_IS_0/K_IS_1 branches
>
> In the Y_IS_NOT_HUGE block 15/17 instructions are generated in the if
> and else branches for k == 0 and k == 1, implementing what is almost
> exactly the same computation. The two generated sequences differ very
> slightly. In the k == 1 case dp_h and dp_l need to be folded into the
> computation to subtract ln2(1.5) from the result while in the k == 0
> case dp_h and dp_l are both 0.0 and can be ignored.
>
> The most important difference is the need to load dp_l/dp_h from the
> coefficient table in one branch while the other merely moves forward the
> cursor which points at the table. The other differences consist of two
> extra operations in the k == 1 branch, an extra fmlavs and an extra
> faddd, which fold in the dp_l and dp_h values.
>
> An alternative would be to use common code for the computation which
> always perform the extra fmlavs and faddd. The revised algorithm
> describes precisely this alternative. To make it work the k = 0 branch
> needs to install 0.0 in the registers used for dp_h and dp_k (not
> necessarily by loading from the table). This shortens the branches,
> relocating 15 common instructions after the branch
>
> As far as code clarity is concerned it is easier to understand and
> maintain if the common code is generated only once. As for performance,
> I believe this trade off of a few more cycles against code size is also
> a better choice. Shaving instructions may give a small improvement in a
> benchmark, especially if the benchmark repeatedly runs with values that
> exercise only one of the paths. However, in real use the extra code size
> from the duplicated code is likely to push more code out of cache. Since
> this is main path code that is actually quite likely to happen.
>
> So, I would like to see the duplication removed unless you can make a
> really strong case for keeping it. If you can provide such a reason then
> an explanation why the duplication is required needs to be provided in
> the revised algorithm and the code and the algorithm need sot be updated
> to specify both paths.
>
> 4) Repetition of odd/even tests in exit cases.
>
> The original algorithm takes the hit of computing the even/odd/fraction
> status of y inline, i.e. in the main path, during special case checks.
> That happens even if the result is not used later. You have chosen to do
> it at the relevant exits, resulting in more repeated code.
>
> These cases will likely be a rare path so the issue of extra code size
> is not going to be very important relative to the saved cycles. However,
> the replication of inline code is a maintenance issue.
>
> It would be better to use a separate function to generate the common
> code that computes the test values (lowest non-zero bit idx and exponent
> of y) avoiding any need to read, check and fix the generator code in
> different places. Please update the code accordingly.
>
> 5) Test code
>
> You need to include a test as part of this patch which checks the
> outputs are correct for a few selected inputs that exercise the
> underflow and Y_IS_HUGE branches. I adapted the existing Math tests to
> include these extra checks:
>
> public static int testUnderflow() {
> int failures = 0;
> failures += testPowCase(1.5E-2D, 170D, 8.6201344461973E-311D);
> failures += testPowCase(1.55E-2D, 171.3D, 1.00883443217485E-310D);
> // failures += testPowCase(150D, 141.6D, 1.3630829399139085E308);
> return failures;
> }
>
> public static int testHugeY() {
> int failures = 0;
> long yl = 0x1_1000_0000L;
> failures += testPowCase(1.0000000001, (double)yl,
> 1.5782873649891997D);
> failures += testPowCase(1.0000000001, (double)yl + 0.3,
> 1.5782873650365483D);
> return failures;
>
> You don't have to add this to the existing math test suite. A simple
> standalone test which checks that the StrictMath and Math outputs agree
> on these inputs is all that is needed.
>
>
> Rework
> ------
>
> 1) Please check that the revised algorithm I have provided accurately
> reflects the computations performed by your code. That will require
> changing the code to deal with the error cases 1, 2, 4 and 5 above. If
> you stick with the Estrin form in case 3 then ensure the algorithm is
> correct. If you stick with Horner form then update the algorithm so it
> is consistent.
>
> The algorithm currently details all mappings of variable to registers.
> That is provided as a heuristic which i) allowed me to track usages when
> writing up the algorithm and ii) will allow you to analyze the current
> register mappings and rationalize them. Once you have a more coherent
> strategy for allocating variables to registers details of the mapping
> can and should be removed.
>
> As mentioned, the algorithm uses a suffix notation to distinguish
> different values where there is some scope for ambiguity. The suffices
> are used as follows
>
> 1) '_d' double values (mapped to d registers)
>
> 2) '_hd' and '_ld' hi/lo double pairs used to retain accuracy (mapped
> to independent d registers)
>
> 3) '_d2' double vectors (mapped to 2d v registers)
>
> 4) '_r' long values that represent the long version of a double value
> (mapped to general x registers)
>
> 5) '_h' int values that represent the high word of a double value
> (mapped to general w registers)
>
> In many unambiguous cases a suffix is omitted e.g. x, y, k, n.
>
> 2) Sort out inconsistent mappings and unnecessary remappings
>
> One of the problems I faced in writing up the algorithm is that some of
> your register use appears to be inconsistent -- the same 'logical'
> variable is mapped to different registers at different points in the code.
>
> In some cases, this reflects different use of the same name for
> different quantities calculated at different stages in the algorithm
> (for example, z is used to name both log2(x) and then reused later for
> the fractional part of log2(x)). Most of those reuses are actually
> catered for by declaring the var in one block and then redeclaring it in
> a different sibling block. If this block structure is replicated in the
> code then it will enable z to be mapped differently for each different
> scope. However, that's not the preferred outcome. It would make the
> code easier to follow if every variable named in the algorithm was
> always mapped to the same register unless there was a clear need to
> relocate it.
>
> There are also cases where a remapping is performed (without any
> discernible reason) within the same scope or within nested scopes! For
> example, after the sign of x_r has been noted (and, possibly, it's sign
> bit removed) the resulting absolute value of x_r is moved from r0 to r1
> (using an explicit mov). There doesn't seem to be any need to do this.
> Likewise, in the COMPUTE_POW_PROCEED block the value for z stored in v0
> is reassigned to v18 in the same block!
>
> I have flagged these remappings with a !!! comment and avoided the
> ambiguity that arises by adding a ' to the remapped value (x_r', z') and
> using it thereafter. This ensures that uses of the different versions of
> the value located in different registers can be distinguished.
>
> An example of remapping in nested scope is provided by p_hd and p_ld. At
> the outermost scope these values are mapped to v5 and b6. However, in
> the K_IS_SET block they are stored in v17 and v16 (values for v and u,
> local to the same block, are placed in v5 and v6 so there is no reason
> why the outer scope mapping could not have been retained).
>
> I'd like to see a much more obvious and obviously consistent plan
> adopted for mappings before the code is committed.
>
> 3) Insert my original and revised algorithms into your code in place of
> the current ones.
>
> 4) Change the code to introduce local blocks as per the revised algorithm
>
> This should mostly be a simple matter of introducing braces into the
> code at strategic locations (although please re-indent).
>
> 5) Change the code to use symbolic names for register arguments and
> declare those names as Register or FloatRegister in the appropriate
> local scope
>
> The main point of employing consistent, logical names for values defined
> in the algorithm is to allow registers employed in the code to be
> identified using meaningful names rather than using r0, r1, v0, v1,
> tmp1, tmp2 etc.
>
> So, for example at the top level of the routine you need to declare
> register mappings for the function global variables and then use them in
> all the generated instructions e.g. the code should look like
>
> // double x // input arg
> // double y // input arg
> FloatRegister x = v0;
> FloatRegister y = v1;
> // long y_r // y as long bits
> Register y_r = rscratch2
> // long x_r // x as long bits
> Register x_r = r0
> // double one_d // 1.0d
> FloatRegister one_d = v2
>
> . . .
>
> // y_r = DOUBLE_TO_LONG(y)
> fmovd(y_r, y);
> // x_r = DOUBLE_TO_LONG(x)
> fmovd(x_r, x)
>
> . . .
>
> Similarly, in nested blocks you need to introduce block local names for
> registers and then use them in the code. For example in the
> SUBNORMAL_HANDLED block
>
> bind(SUBNORMAL_HANDLED);
> // block introduced to scope vars/labels in this region
> {
> Label K_IS_SET;
> // int ix_h // |x| with exponent rescaled so 1 =< ix <
> Register ix_h = r2;
> // int k // range 0 or 1 for poly expansion
> Register k = r7
> // block introduced to scope vars/labels in this subregion
> {
> // int x_h // high word of x
> Register x_h = r2
> //int mant_x_h // mantissa bits of high word of
> Register mant_x_h = r10
>
> . . .
>
> // x_h = (x_r' >> 32)
> lsr(x_h, x_r, 32);
> // mant_x_h = (x_r >> 32) & 0x000FFFFF
> ubfx(mant_x_h, x_r, 32, 20); // i.e. top 20 fractions bits of x
>
> . . .
>
> }
> bind(K_IS_SET);
>
> . . .
>
> You should be able to hang the code directly off the algorithm as shown
> above, making it clear that it matches the revised algorithm and
> allowing meaningful comparison with the original.
>
> If you have changed the code in your latest revisions then please update
> the algorithm accordingly to ensure they continue to correspond with
> each other.
More information about the hotspot-compiler-dev
mailing list