[aarch64-port-dev ] RFR: 8189107 - AARCH64: create intrinsic for pow

Mon Aug 20 16:00:32 UTC 2018

Hi Andrew,

First of all, thank you for such attentive approach.

I prepared webrev.02 which address your comments according to items below.

Brief description(including large 4 missing elements you described:)

1. <Validation tests which verify that correct results are obtained in 
each of the corner cases laid out in the special case rules in the C 
code header.>

and

2. <Validation tests which verify that correct results are obtained for 
a representative range of non-special case input pairs that exercise the 
main routine.>

I believe these are already covered by existing tests I launched to 
check this patch. You can check existing jdk jtreg test here: 
http://hg.openjdk.java.net/jdk/jdk/file/tip/test/jdk/java/lang/Math/PowTests.java

which covers all corner cases multiple times (75 * 75 combinations of 
all special/usual values, where 75-elements array is used to form all 
combinations of X and Y values. Every special value of X and Y is tested 
with 75 other arguments).

I also used JCK test. This test consists of 2 parts: 
api/java_lang/Math/pow1Test.java (30 tests for special cases) and 
api/java_lang/Math/powTests.java (1710 tests for non-special values). 
These tests already covers code in a way you described. Please let me 
know it you think I should add some tests.

3. <Performance tests which compare performance of this implementation 
against the performance obtained using the C code for selection of 
different ranges of possible input pairs, each such range benchmarked 
independently>

I upgraded benchmark to have all special cases specifically measured: 
http://cr.openjdk.java.net/~dpochepk/8189107/

Updated results are here(up to +200% improvement): 
http://cr.openjdk.java.net/~dpochepk/8189107/math.pow.full.xls

4. <The comments in the code explain some of what is happening but they 
do not explain the algorithm being used clearly enough for it to be 
possible to understand and maintain the code as is.>

I added C-like pseudo code with additional comments including links to 
labels and vectorization/optimization explanations.

updated webrev: http://cr.openjdk.java.net/~dpochepk/8189107/webrev.02/

Thanks,

Dmitrij

On 17/08/18 12:21, Andrew Dinn wrote:
> On 16/08/18 10:23, Andrew Dinn wrote:
>> On 14/08/18 17:24, Dmitrij Pochepko wrote:
>>> Benchmarks:
>>>
>>> I created benchmark to measure Math.pow:
>>> http://cr.openjdk.java.net/~dpochepk/8189107/MathBench.java
>> Just a preliminary comment before I get back to the code.
>> . . .
>> I am not clear that it will significantly affect the outcome that the
>> two values are equal but the fact that they are all positive is perhaps
>> important. It might be worth re-running this benchmark using independent
>> ranges for X and Y, including negative values for both:
> Ok, let's revise that, now I have engaged my brain with the maths and,
> to a lesser degree, the code. Clearly, negative X values are only
> interesting for integral Y values (still worth testing) and negative Y
> values are not the most important issue, certainly not as important as
> /small/ Y values. However, rather than considering those cases directly
> I have provided a more general critique below and indicated the range of
> tests that are really needed here.
>
> I suggest you read that critique carefully, especially the Why?
> arguments which motivate it. However, to summarize, the upshot is:
>
> you need to provide more and better testing, both to validate the
> implementation and to better measure performance differences
>
> you need to provide clearer, more complete and more precise
> documentation of the generated asm code, including a new C or pseudocode
> algorithm that accurately matches the asm, making it clear how the asm
> is structured and how it relates to the original C code
>
> Andrew Dinn
> -----------
> Senior Principal Software Engineer
> Red Hat UK Ltd
> Registered in England and Wales under Company Registration No. 03798903
> Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander
>
> ----- 8< -------- 8< -------- 8< -------- 8< -------- 8< -------- 8< ---
>
> There are 4 major things that I think are missing in this patch that
> really need covering before we can accept it into the code base and
> /commit/ to maintaining it. Without these things being addressed I am
> not happy about including what is a much more complex implementation
> when the performance gain is a few tens of percent. n.b. an order of
> magnitude -- even if it was only for certain important, common cases --
> would be much more compelling so one of the tasks is to investigate that
> option.Yes,
>
> The missing elements are as follows:
>
> 1) Validation tests which verify that correct results are obtained in
> each of the corner cases laid out in the special case rules in the C
> code header.
>
> In cases where there is a specific pair of values which triggers the
> relevant case processing then each such combination needs to be checked
> and the result compared to the one returned by the current algorithm.
>
> In cases where the rule applies to a range of values (e.g. 0 **
> +/-anything or -anything ** integer) then a 'representative' range of
> input pairs should be used.
>
> So, for example, in the case of 0 ** +/-anything it should be tested
> using a few values (e.g. 1.5, 1.7, 255.0, 1000.0, 2.0e-23, 2.7e-120 or
> whatever) not covered by the previous case handling.
>
> n.b. those numbers are chosen because, unlike the numbers you chose to
> test with, they include exact and inexact even and odd integral and
> non-integral representations. This may not matter much here but making a
> suitable selection will matter when it comes to driving code through the
> various different paths in your main routine and the (mostly)
> corresponding paths in the original C code.
>
> Similarly, for -anything ** integer a range of positive and negative
> integer Y values should be used in combination with a range of anything
> X values to ensure the test exercises as many different paths through
> the main body as possible (e.g. large, medium, small and subnormal Y
> values). This definitely ought to include cases where overflow and
> underflow occur.
>
> Why? Well, we need tests so that we know the new asm is doing the right
> thing now and, that it stays that way in case we meet the need to change
> generated code later.
>
> 2) Validation tests which verify that correct results are obtained for a
> representative range of non-special case input pairs that exercise the
> main routine.
>
> The same requirement applies here as for the -anything ** integer case
> above that the tests should attempt to exercise as may paths through the
> code as possible by selecting a range of input pairs that will guarantee
> full coverage.
>
> Why? Well, clearly this is not going to give us any confidence about
> correctness for the whole range of potential inputs but exercising all
> paths through the code in at least one case will find any obvious
> blunders in the current code and ensure we don't introduce any obvious
> blunders in case we meet the need to change the generated code later.
>
> 3) Performance tests which compare performance of this implementation
> against the performance obtained using the C code for s selection of
> different ranges of possible input pairs, each such range benchmarked
> independently.
>
> The problem with the existing 'benchmark' is that it doesn't cover a lot
> of the potential range of inputs and it lumps all computations together
> even when they might be exercising different paths through the main
> routine. This might be worth retaining to compute some sort of average
> case (although the selectivity in the input values means it is unlikely
> to do that very well). What it does not do is identify specific cases
> that exercise an individual paths through the two codes where the new
> one may improve (or, maybe, perform worse) than the old one by some
> amount that varies path by path. i.e. we don't really know the best or
> worst case improvement of this algorithm.
>
> So, for example: it would be useful to know what he performance
> difference is when exercising the algorithm for specific X values with a
> range of low integral Y values (say 0 < Y < 30, or -5 < Y < 5). It would
> probably be useful to test what happens when using subnormal X or very
> large X, paired with very small Y. It might also be interesting to know
> what happens when using values of X close to 1 and very large Y.
> Clearly, there is a limited number of paths through each
> implementation's main routine so the test cases should probably be
> chosen to ensure specific paths are taken, highlighting what happens
> when we exercise that path.
>
> Why? The performance gain reported by the current benchmark is not
> necessarily revealing of what happens in the most interesting cases and
> is not necessarily representative of what we might expect in real cases.
> An argument that derives from clearer benefits in some common, special
> cases (e.g. relatively low integer powers) or in less common cases where
> performance might still really matter for certain specific algorithms
> (e.g. subnormal handling) might make a more compelling case for taking o
> the maintenance burden. Of course, a much smaller gain for those cases
> might sway the argument the other way.
>
> 4) The comments in the code explain some of what is happening but they
> do not explain the algorithm being used clearly enough for it to be
> possible to understand and maintain the code as is.
>
> This problem is mitigated to a small degree by the inclusion of the
> original C code but, unfortunately, that code is not really good enough.
> It presents a specific encoding of the mathematics which justifies this
> algorithm but one which has significant differences to the encoding
> employed in the generated asm. There are several things at play here so
> let me articulate them separately.
>
> Firstly, the C code is operating on a different data representation to
> the generated asm. It uses two pairs of 32-bit signed words hx, hy and
> ix, iy for the high 32 bits of X and Y and high 31 bits of X and Y,
> respectively and uses a pair of unsigned words lx, ly for the low 32
> bits of X and Y. The generated code operates on 64 bit values in 64 bit
> registers.
>
> That does not in itself invalidate comparisons of the two algorithms
> given that all the relevant bits contained in ix, iy, hx, hy, lx and ly
> are contained in the 64 bit X and Y registers. However, since the asm
> instructions in the generated code operate on the whole, signed
> enchilada and the C code operates independently on the upper and lower
> halves in signed or absolute versions this causes a significant
> divergence in how individual mathematical steps are encoded as C vs asm
> instructions. If nothing else comparisons differ as to their signedness
> and often operate in two stages in the C code instead of one in the asm.
> Tests also get performed in different orders as a result of this
> different representation. That makes the C code very far from clear as
> an explanation of what the asm is doing.
>
> Secondly, the decision tree used by the generated asm to handle the
> various corner cases mentioned in the header has been significantly
> re-arranged in the generated code. The C code tests for and handles
> various of the corner cases in a very specific order that matches the
> declaration of he corner cases i.e. it first checks for
>
>    y == +/-0 ==> return special result
>    y == NaN || x == NaN ==> return special result
>
> then it computes yisint in { non_integer, odd_integer, even_integer }
> then it tests for and handles these corner cases
>
>    y == +/-Inf, x == 1 ==> return special result
>    y == +/-Inf, x < 1 ==> return special result
>    y == +/-Inf, x > 1 ==> return special result
>
>    y == +/-1 ==> return special result
>    y == 2 ==> return special result
>    y == 0.5 ==> return special result
>
> and so on
>
> The generated code uses an alternative, branching decision tree which, I
> admit, looks to be an improvement. However, as a result the original C
> code now fails to provide an explanatory algorithm for what has been
> implemented.
>
> I'll grant that the block comments at each branch point do make it clear
> what cases are being considered in each subtree of the generated code.
> However, it would help enormously if the original C code was
> restructured to provide a new algorithm that reflected the alternative
> control flow that has actually been implemented (if preferred, the
> alternative algorithm could be provided as pseudo-code).
>
> This would have two benefits. Firstly, it would provide a much better
> summary of what the assembly code is doing. Secondly, it would allow the
> new control flow to be compared with the old one (both versions are
> needed), making it clearer whether the revised control flow really
> catches and handles the various corner cases in a way that is equivalent
> to the original C code.
>
> Providing a new C or pseudo-code algorithm would also help with the
> remaining serious difficulty. It would also help clarify the asm in
> cases where the tests in the original C code are very different to those
> made in the asm. Let's consider an example that makes this very clear.
>
> In the generated code a branch to X_IS_SPECIAL gets taken when |X| is
> Nan or Inf and Y is not +/-0, 1, Nan or +/-Inf.
>
>      // Assume Y == NaN/INF already checked
>      bind(X_IS_SPECIAL);
>        cmp(zr, X, LSL, 12);
>        br(NE, RET_NAN);
>        // X is +/- INF
>        // Already covered: Y == +/-0, 1, NaN and INFYes,
>        // Y > 0 => return +INF
>        // Y < 0 => return +0
>        tbnz(X, 63, X_IS_MINF_Y_IS_NUMBER);
>        . . .
>
> If the code falls through to the branch target X_IS_MINF_Y_IS_NUMBER we
> reach the following code:
>
>      bind(X_IS_MINF_Y_IS_NUMBER);
>        // tmp2 = unbiased exponent
>        rbit(tmp4, rscratch2);
>        orr(tmp4, tmp4, 0x800);
>        mov(tmp5, 52);
>        clz(tmp4, tmp4);
>        sub(tmp10, tmp2, 1023);             // biased exponent
>        sub(tmp4, tmp5, tmp4);
>        cmp(tmp10, tmp4);
>        br(EQ, X_IS_MINF_Y_IS_ODD);
>        // X == -INF && "Y < 0 but not a finite odd integer" => return +0
>        // X == -INF && "Y > 0 but not a finite odd integer" => return +INF
>        tbz(rscratch2, 63, RETURN_PINF);
>        fmovd(v0, 0.0d);
>        ret(lr);
>
> Now the C code for handling this case has already computed whether or
> not Y is an even int, an odd int or not integral using a very different
> test. First it compares iy against 0x43400000 to see if the Y value has
> a biased exponent > 1023 + 52 (= 0x434) or, equivalently, an unbiased
> exponent > 52. This guarantees that Y is even, whatever is in the 52
> fractional bits. Failing that it compares iy against 0x3ff00000 i.e.
> sees if the exponent of Y is >= 0 (obviously, necessary for the value to
> be integral) and then does some clever bit shifting on ly or iy using
> that exponent to see if there are any bits in the lower or upper word
> that would make y non-integral or odd.
>
> Clearly, the code above is using a somewhat different algorithm even
> though it can be correlated with some of the C code (for one thing, it
> is only checking for y both odd and integral).
>
> The rbit is used to allow trailing zeroes in the fractional part to eb
> counted using clz (count leading zeroes). The orr at bit 11 of the
> reversed value of Y ensures that clz never counts leading zeroes beyond
> bit 11 -- equivalently it only counts trailing zeroes in the 52-bit
> fractional part of Y. So, if the value, let's call it T, returned by clz
> is the number of trailing zero bits in the fractional part then,
> clearly, for Y to be odd the unbiased exponent has to be exactly (52 -
> T). i.e. to verify that Y is odd and integral we need to test that:
>
>    (biased_exp - 1024) == lowest_fraction_bit_index(Y) // in range 0..52
>                        == min(trailing_zeroes(Y), 52)
>                        == trailing_zeroes(Y | (1 << 53))
>                        == leading_zeroes(reverse(Y) | (1 << 11))
>
> Obviously, a minor variant of this can be used to detect even integer
> values of Y and non-integer Y.
>
> Why? An explanation of what code like the above example is doing and why
> it is correct is critical to allowing whoever has to maintain it to
> understand how it works without having to put days of analysis in
> whenever they have to look at it. Without it this it will be a fly trap
> (exactly as I explained wrt the sin and cos code). The original C code
> provides almost no help here. A pseudo-code algorithm for the asm is the
> minimum that is needed. Tricks like the one above require an explanatory
> comment that provide arithmetical/mathematical background that explains
> what the pseudo-code is doing and why.