RFR: AArch64: Implementing address lowering to make use of immediate address mode

Tue Feb 28 11:44:26 UTC 2017

I have patched (a slightly out of date version of) Graal to allow
AArch64 to make use of scaled and unscaled immediate addressing. This
generates much better code but the implementation is 'wrong' in several
important respects. A pull request including the change is available here

  https://github.com/graalvm/graal-core/pull/258

I have no wish for this change to be included as is - I posted it merely
to enable discussion of what is really needed.

The patch deliberately bypasses the normal AddressLowering phase (I'll
explain why in a second). Instead it intercepts translation of addresses
during the generate phase. Basically, it attempts to improve addresses
initially generated as an OffsetAddress (i.e. supplied with 2 Values,
base and index) when the index is an integer constant whcih can be
embedded as an immediate displacement (I guess it ought to handle the
reverse case where base is a constant and index a general Value but
nothing appears to be generating addresses with constants in that
order). It does so by 'nulling' the index saving the constant as a
displacement and resetting the addressing mode from register offset to
scaled or unscaled immediate.

So, before applying the patch addresses with constant index were
'improved' by replacing the constant node with a constant load and
relying on the use of register_offset addressing mode to do the actual
memory operation. The result is code like this:

  movz x3, #0xc           # constant load of offset value
  movk x3, #0x0, lsl #16
  ldr w1 [x2, x3]         # (1) load via register_offset addressing

The patch tries instead to transform the node to use either scaled or
unscaled immediate addressing resulting, respectively, in code like this:

  ldr w1, [x2,#12]        # (2) load via scaled immediate addressing

or (let's assume we had a negative offset mandating an unscaled load)

  ldr w1, [x2, #-12]      # (3) load via unscaled immediate addressing

There are two related reasons why this has been done in the Generate
phase rather than in the AddressLowering phase. I present them both in
turn below in order to pose some questions about why AddressLowering is
currently done as is and to suggest what might be needed to fix it.

I have also include a critique of the way a derived LIRKind is computed
for the AddressValue returned by method generate(). I am unclear why it
is currently being computed as is and also unclear what would be an
appropriate derived value when immediates are encoded. Advice or
comments regarding the patch and the following critique would be very
welcome.

Scaling depends on the transfer size
------------------------------------

The first problem is that for scaled memory addressing the decision as
to whether or not a given constant index can be embedded as an immediate
constant in the address node depends upon how the address is used by one
or more memory ops.

In code example (2) which uses scaled addressing above the load
instruction is transferring a 32 bit datum (as indicated in the assembly
code by the target register being named w1). The load address is the 64
bit value in r1 (as indicated in the assembly code by the base register
being named x1) plus constant offset 12.

A scaled immediate memory op can embed a 12 bit unsigned displacement.
However, the unit size for the embedded value is determined by the size
of the accessed datum. So, for a 32 bit load/store the set of offsets
which are valid is {4, 8, ... 4 * (2^12 - 1)}. For a byte load the set
of valid offsets is {1, 2, ... (2^12-1)}.

Obviously, as the name clearly indicates, there is no dependency on
transfer size when using unscaled addressing. Memory ops employing this
latter addressing may embed any signed 9-bit displacement which is
always counted in a unit size of bytes. So, translation to use unscaled
addressing in the AddressLowering phase doesn't present any great
problem. However, generating the best code requires implementing both modes.

The upshot is that in order to determine whether a constant index can be
replaced by an immediate node the lowering code needs to investigate
usages of the address node and establish that all usages employ a unique
transfer size.

As you can see in the pull request, the patch includes a method
getUsageSize() which does that job. It traverses all usages and returns
either a single platform kind which defines a transfer size common to
all usages or null if no such unique common kind exists. This leads to
the second problem. (n.b. I have finessed Prefetch usages for now
because the current AArch64 Graal does not generate code for prefetch.
That will need fixing when it is implemented).

Identification of the platform kind requires a generator
--------------------------------------------------------

The above requirement explains why the lowering code is implemented as
part of the Generate phase instead of in the AddressLowering phase.
Identifying the PlatformKind associated with a specific usage of the
AddressNode requires translating the stamp of each usage to an LIRKind.
That mandates availability of a suitable instance of LIRKindTool
(actually an AArch64LIRKindTool) which parameterizes the call to
getUsageSize(). The KindTool must be obtained from the
AArch64LIRGenerator tool which is obtained form the NodeLIRBuilderTool
passed to the AArch64AddressNode generate() method.

I don't doubt that here is some sneaky way of ensuring that the
AArch64AddressLowering instance obtains access to an AArch64LIRGenerator
(and, thence, an AArch64LIRKindTool) behind the scenes. However,
implementing a solution like that does not really seem to me to be the
correct solution. There is an assumption in the current code that
address lowering can be done prior to the Generate phase without needing
to identify whatever transfer datum and/or transfer size is at stake
when the address is being used. That's a broken assumption and I would
prefer to fix it.

One solution would be to have the generic code precompute the common
usage LIRKind and pass it to the implementation class which extends
AddressLowering. Alternatively, the abstract parent class could provide
a convenience method allowing a suitable PlatformKind to be computed.
Either way this means that the parent needs to know about the Arch and
know how to obtain an arch-specific KindTool. I'm happy to look into
proposing a solution along these lines but I'd like to canvas here for
comments before doing so.

Another alternative would be to perform address lowering at a later
stage, perhaps in the back end even. This would perhaps require doing
some of the transforms done in e.g. AMD64AddressLowering as generic
transforms and having a generic AddressNode which allowed for a
non-constant index and/or constant displacement. I'm also happy to
consider how this might work but I'd probably need more advice about how
to go about this.

The LIRKind of the resulting AddressValue -- is it correct?
-----------------------------------------------------------

The patch in the pull request attempts to follow the logic of the
preceding code in deriving an LIRKind for whatever AddressNode it
constructs, whether using register offset or immediate addressing.
However, I am not sure what the original logic is.

In the original code for AArch64AddressNode.generate() the case where
the either base or index was null is treated by setting, respectively,
baseValue or indexValue to Value.ILLEGAL. Later in that original code
this implies that indexReference is also set to Value.ILLEGAL. The
LIRKind for the final resulting AArch64AddressValue is computed by executing

  LIRKind kind = LIRKind.combineDerived(tool.getLIRKind(stamp()),
                                        baseReference, indexReference);

So, when base is, say, a register and index is null this represents a
direct reference through a register with no offset. I would have
expected that in this circumstance there was some coherent way of
deriving the LIRKind of the address from the LIRKind associated with
base. However, because index is null, so indexValue is Value.ILLEGAL and
hence indexReference is also Value.ILLEGAL. combineDerived handles this
case by returning an unknown reference derived from the stamp(). By
contrast, if indexReference was null then combineDerived would compute
the combined reference by calling makeDerivedReference(base).

My patch follows this code by starting off with indexValue = null then
resetting it to Value.ILLEGAL either if it is null or if it is possible
to intervene and replace a constant index with a displacement.
Otherwise, when we have a non-null index, indexValue is computed as
before by executing

  indexValue = tool.asAllocatable(gen.operand(index));

I have also preserved the original special case processing for
AddressingMode.IMMEDIATE_UNSCALED where the indexReference is derived by
executing

  indexReference = LIRKind.derivedBaseFromValue(indexValue);

although I'll note that in cases where that addressing mode is
introduced by replacing a constant index with an unscaled displacement
indexValue will be Value.ILLEGAL and hence the indexReference will be
returned as  Value.ILLEGAL which seems to negate the purpose of that
special case handling.

I am suspicious that all of this computation seems to be rather
redundant anyway. At the point of use of an Address to encode a Load or
Store (or possibly a Prefetch) the LIRKind of the address appears to be
ignored and instead the LIRKind of the transfer datum is used to
generate the load. Is this computation of the derived actually achieving
anything important?

regards,

Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander