RFR: 8189107 - AARCH64: create intrinsic for pow

Fri Sep 7 12:58:59 UTC 2018

Hi Dmitrij

On 22/08/18 11:04, Andrew Dinn wrote:
> Thank you for the revised webrev and new test results. I am now working
> through them. I will post comments as soon as I have given the new code
> a full read and assessed the new results. I am afraid that may take a
> day or two, for which delay advance apologies.

This review has taken a great deal longer than expected. I am sorry but
that is because the documentation for the code you have submitted is
still seriously inadequate and I have had to put a lot of work into
revising it before I can fully review the code.

I am still finishing off that last task but I wanted to start providing
you with some feedback and also to enlist your help in checking that my
revisions are correct. I plan to provide feedback in 3 stages to match
the 3 steps in the review that I am doing as follows:

1) Correct the original 'algorithm' you started from

2) Correct the 'modified algorithm' that is meant to describe the
behaviour of your code

3) Propose any necessary corrections/improvements to the generated code

So, let's start with step 1.

The 'original' algorithm located in file macroAssembler_aarch64_pow.cpp
is really just a fragment of C code with a few missing elements (e.g.
the origin of values P1, P2, ... is not explained, hugeX, tiny are not
defined). Although this code as the virtue that it is known to be
correct (or at least has been verified by long use and the eyes of
experts in numerical computation) it still fails to provide important
information about what the 'algorithm' is supposed to do. That
information is critical for anyone coming to it fresh to be able
understand what is happening.

The first omission is several pieces of background mathematics that are
neither explained in the code nor referenced. The mathematics includes
the formulae on which the algorithm is based and the numerical
approximation to these formulae that is employed to define the
algorithm. This is needed to explain /how/ and /why/ a) the two
different computations of log2(x) and b) the computation of exp(x) are
performed as they are and to justify that the results are valid.

The second omission is detailed descriptions of what most of the more
complex individual steps in the algorithm do. Many of the logic,
floating point and branching operations which compute intermediate
results are extremely opaque. This is particularly so for the steps
which manipulate bit patterns in the long representation of the fp
values being used. However, some of the straight fp arithmetic is also
highly problematic.

The other thing I think needs to be made clearer is the relationship
between the various special case return points in the code and the
special case rules they relate to. This is not so critical for the
original algorithm because the C code at least has a regular and
standardised control flow. However, labelling the exit paths is still
useful here and will be much more useful if used both here and in the
modified algorithm (and we'll come to that later).

I have rewritten the algorithm to achieve what I think is needed to
patch these omissions. The redraft of this part of the code is available
here:

  http://cr.openjdk.java.net/~adinn/8189107/original_algorithm.txt

I assume you are familiar with the relevant mathematics and how it has
been used to derive the algorithm. If so then I would like you to review
this rewrite and ensure that there are nor mathematical errors in it. I
would also like you to check that the explanatory comments for of the
individual steps in the algorithm do not contain any errors.

If you are not familiar with the mathematics then please let me know. I
need to know whether this has been reviewed bu someone competent to do so.

n.b. one little detail you might easily miss. I removed lg2, lg2_h and
lg2_l from the first table of constants as neither log(x) algorithm
needs them (it relies on ivln2). I renamed the entries in the second
table from LG2, etc to /ln2/, etc and change the name accordingly at
point of use. The computation of exp(x) actually does need ln2. One of
the code changes is to remove these redundant entries from your table
pow_coeff1.

Ok, as for the next 2 steps will post a follow-up to deal with them once
I have completed my review. That will include a heavily revised version
of your 'modified algorithm' (which is still in progress) plus
suggestions for changes to the code that I have found along the way.
Just as a preliminary I'll summarize what is wrong below.

Note that I have not yet found any errors in how the generated code
implements the mathematics but I am still not happy with it because it
is extremely unclear. Correcting the 'modified algorithm' is a
necessary, critical step to improvimg the clarity of the code.

So, in overview, what is wrong with your 'modified algorithm'. Well, the
thing that is immediately obvious is that it is /not/ actually the
algorithm you have employed. It is simply a mangled version of the C
code you started from that bears only a tenuous relation to the code
structure it is supposed to summarize. Now, I'm happy for you to use C
to model the generated code if possible and, in fact, am in the process
of writing a proper algorithm that looks as much like C code as possible
/but/ also actually describes what your generated code does. The problem
is that what you have written is not only /not/ C it is also i)
incoherent, ii) retains elements from the original code that don't exist
at all in the generated code and iii) omits important elements of the
generated code.

So, firstly, let's deal with the problem as it relates to control flow.
Your 'modified algorithm' includes various tags mentioning the word
'label' suggesting that some transfer of control is to be effected.
However, these are tacked onto statement blocks connected via 'if
(cond)' tests or 'else' keywords that are meant to imply some
alternative control flow. Essentially, your generated code relies on
gotos which do not fit a standard if/else flow model and you have tried
to bodge some sort of goto model on top of the original valid, gotoless
C control flow with no clear definition of how that is meant to work.
Honestly, if your generated code uses a goto control flow then your C
algorithm is going to have to do the same in order to clearly summarize
what the code actually does.

The second major problem is one I pointed out in my earlier note, i.e.
that the data values described in the 'modified algorithm' do not
correctly match the ones operated on in your generated code. Your
algorithm lists many redundant values used in the original algorithm
(e.g. ix, iy, ax, yisint) even though your code doesn't ever explicitly
construct most of those values (n.b. this but not just limited to the 32
bit half-word quantities). Instead the code frequently pulls the
relevant value, as needed, out of other data that it does construct and
holds in registers -- sometimes across control branches. At other times
it performs an equivalent operation on a different, related data value.
Your response to my request was to add comments which labelled some of
these on-the-fly created values or alternative values with the original
names but that ignores the fact that the names and the values referenced
in the comments do not actually match.

Contrariwise, a lot of the values the code does actually operate on are
not mentioned in the algorithm. Indeed, it is worse than that because
they are not coherently identified even in the generated code. Data
items stored in registers are referred to using the utterly redundant
symbolic aliases tmp0, tmp2, etc for registers r0, r1 etc. What is worse
the same meaningless symbolic names get reused for completely different
data items.

For example, at one point tmp2 identifies the exponent of y stored in r2
and later it identifies the absolute value of y also stored in r2,
overwriting the exponent. Your algorithm really ought to mention values
like exp_y or ay (or even |y|) for these cases and the code should
correspondingly define exp_y and ay as an alias for register r2. These
meaningful names should then be used when loading the constructed value
into a register and at every subsequent point of use where that
constructed value is valid.

This is not all that is wrong with the 'modified algorithm' but it is
enough to make it not just useless but worse than useless. What you have
written provides a hand-wave towards what the code does that fails to
summarize it with any accuracy or clarity and equally fails to clarify
the difference between it and the C code you started from. That only
makes the whole picture less clear not more so.

As I said, I will provide a better version of the 'modified algorithm'
in a follow-up and then discuss possible code changes. Please review the
linked file above while I prepare that.

regards,

Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander