RFR: AArch64: Implementing address lowering to make use of immediate address mode

Mon Mar 20 10:42:27 UTC 2017

Make sure to be on the latest version. A few days ago the address lowering phase was moved to be after the new fix reads phase. Means no need to deal with pi nodes.

- thomas 

> On 20 Mar 2017, at 09:43, Andrew Dinn <adinn at redhat.com> wrote:
> 
> Hi Tom,
> 
> Thanks for the advice and apologies for not replying sooner -- I have
> been chasing a very nasty bug in the (non-graal) jdk8u x86 JVM code for
> the last few weeks.
> 
> I'll attempt to follow your advice and write an alternative lowering
> phase for use by AArch64. However, I'd also like to see if there is a
> better approach that either postpones lowering or, at least, unifies it
> for the different architectures. I'll let you know of progress on both
> fronts as/when available.
> 
> regards,
> 
> 
> Andrew Dinn
> -----------
> Senior Principal Software Engineer
> Red Hat UK Ltd
> Registered in England and Wales under Company Registration No. 03798903
> Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander
> 
>> On 06/03/17 23:31, Tom Rodriguez wrote:
>> 
>> 
>> Andrew Dinn wrote:
>>> Hi Tom,
>>> 
>>> 
>>>> On 01/03/17 00:39, Tom Rodriguez wrote:
>>>> There's nothing about the AddressLoweringPhase that says the mapping of
>>>> old the OffsetAddressNode to machine dependent address mode must be one
>>>> to one.  You are free to examine the usages and generate a distinct
>>>> address node for each use.  GVN will common any duplicates back
>>>> together.  You still will need to select the right forms for the backend
>>>> so it doesn't address all the problems you mention.
>>> 
>>> Thanks very much for replying. I'm not quite sure I follow what you are
>>> saying but let me try to confirm that I am on the right track.
>>> 
>>> I do understand that each usage of some given OffsetAddress can be
>>> substituted /per usage/ by an appropriate AArch64AddressNode without
>>> fear of introducing unnecessary replicas. The thrust of your comment
>>> seems to be that my current patch doesn't actually need to detect a
>>> single, unique LIRKind for all usages in order to be able to embed a
>>> constant index as displacement. It could instead traverse all usages
>>> and, where possible, replace the OffsetAddress referenced from the usage
>>> with a lowered AddressNode employing a displacement constructed
>>> according to the LIRKind of that usage.
>>> 
>>> Is that what you are suggesting?
>> 
>> Yes.
>> 
>>> If so then I agree that I might perhaps
>>> be able tweak the patch to do this. However, that's really somewhat
>>> incidental to the concerns I originally raised, perhaps even providing
>>> an a fortiori argument for addressing them. The AddressLowering phase is
>>> given an AddressNode to lower, a node which may be shared by multiple
>>> usages. It is not given a usage (i.e. a node using the AddressNode) and
>>> asked whether it wants to provide a lowered address node specific to
>>> that usage.
>>> 
>>> So, the problem remains that the lowering model AddressLowering has
>>> adopted is based on a misguided notion that lowering can be scheduled
>>> per AddressNode without reference to usage. I really want to try to fix
>>> the model not the current patch.
>> 
>> Well the model was adequate for the platforms we supported at the time
>> which is why I think it was chosen.  The current
>> AddressLoweringPhase/AddressLowering logic is simple enough that I'm not
>> sure it really makes sense to make it complex just to support AArch64.
>> I'd probably just write a custom phase to do what needs to be done, and
>> factor out the creation of the address lowering phase in
>> HotSpotSuitesProvider so that an AArch64HotSpotSuitesProvider could
>> provide it's own completely custom implementation.  I'd probably
>> directly pass in the required AArch64HotSpotLIRKindTool as well which
>> you can just construct by hand.  It could that the the kind tool should
>> be part of the HotSpotProviders instead of buried in the LIRGenerator
>> but it hasn't needed to be more visible until now.  It seems like you
>> would have everything required for proper code generation then.
>> 
>>> 
>>> If your point is that AddressLowering ought, perhaps, to be rewritten to
>>> operate by considering each usage in turn then that would indeed be a
>>> better model.
>>> 
>>> Apologies if I am barking up the wrong tree entirely here and have not
>>> grasped what you are getting at.
>>> 
>>> Of course, m comments about the need to have access to the usage kind
>>> (and, hence, to a KindTool) and my failure to understand how a derived
>>> LIRKind should be computed for any lowered AddressNode also still stand
>>> in the hope of some comment or explanation which would, of course, be
>>> gratefully received.
>> 
>> As far as the LIRKind of an AddressValue most of the time it is actually
>> completely unused.  The purpose of the derived information is for
>> tracking when a pointer into the middle of a Java object has been
>> materialized as a value so that it can be described to the garbage
>> collector or for error checking that a derived value isn't live across a
>> safepoint.  The LIRKind of an AddressValue is the LIRKind of the address
>> as a computation but the LIRKind of the memory operation is LIRKind of
>> the datum being transferred so they are only loosely related.  So the
>> only case where the LIRKind of an address value will ever appear as a
>> value is for an LEA style operation.
>> 
>> I suspect I haven't addressed all your concerns about the LIRKind for
>> addresses so please follow up with any clarifying questions.
>> 
>> tom
>> 
>>> 
>>> regards,
>>> 
>>> 
>>> Andrew Dinn
>>> -----------
>>> 
>>>> Andrew Dinn wrote:
>>>>> I have patched (a slightly out of date version of) Graal to allow
>>>>> AArch64 to make use of scaled and unscaled immediate addressing. This
>>>>> generates much better code but the implementation is 'wrong' in several
>>>>> important respects. A pull request including the change is available
>>>>> here
>>>>> 
>>>>>    https://github.com/graalvm/graal-core/pull/258
>>>>> 
>>>>> I have no wish for this change to be included as is - I posted it
>>>>> merely
>>>>> to enable discussion of what is really needed.
>>>>> 
>>>>> The patch deliberately bypasses the normal AddressLowering phase (I'll
>>>>> explain why in a second). Instead it intercepts translation of
>>>>> addresses
>>>>> during the generate phase. Basically, it attempts to improve addresses
>>>>> initially generated as an OffsetAddress (i.e. supplied with 2 Values,
>>>>> base and index) when the index is an integer constant whcih can be
>>>>> embedded as an immediate displacement (I guess it ought to handle the
>>>>> reverse case where base is a constant and index a general Value but
>>>>> nothing appears to be generating addresses with constants in that
>>>>> order). It does so by 'nulling' the index saving the constant as a
>>>>> displacement and resetting the addressing mode from register offset to
>>>>> scaled or unscaled immediate.
>>>>> 
>>>>> So, before applying the patch addresses with constant index were
>>>>> 'improved' by replacing the constant node with a constant load and
>>>>> relying on the use of register_offset addressing mode to do the actual
>>>>> memory operation. The result is code like this:
>>>>> 
>>>>>    movz x3, #0xc           # constant load of offset value
>>>>>    movk x3, #0x0, lsl #16
>>>>>    ldr w1 [x2, x3]         # (1) load via register_offset addressing
>>>>> 
>>>>> The patch tries instead to transform the node to use either scaled or
>>>>> unscaled immediate addressing resulting, respectively, in code like
>>>>> this:
>>>>> 
>>>>>    ldr w1, [x2,#12]        # (2) load via scaled immediate addressing
>>>>> 
>>>>> or (let's assume we had a negative offset mandating an unscaled load)
>>>>> 
>>>>>    ldr w1, [x2, #-12]      # (3) load via unscaled immediate
>>>>> addressing
>>>>> 
>>>>> There are two related reasons why this has been done in the Generate
>>>>> phase rather than in the AddressLowering phase. I present them both in
>>>>> turn below in order to pose some questions about why AddressLowering is
>>>>> currently done as is and to suggest what might be needed to fix it.
>>>>> 
>>>>> I have also include a critique of the way a derived LIRKind is computed
>>>>> for the AddressValue returned by method generate(). I am unclear why it
>>>>> is currently being computed as is and also unclear what would be an
>>>>> appropriate derived value when immediates are encoded. Advice or
>>>>> comments regarding the patch and the following critique would be very
>>>>> welcome.
>>>>> 
>>>>> Scaling depends on the transfer size
>>>>> ------------------------------------
>>>>> 
>>>>> The first problem is that for scaled memory addressing the decision as
>>>>> to whether or not a given constant index can be embedded as an
>>>>> immediate
>>>>> constant in the address node depends upon how the address is used by
>>>>> one
>>>>> or more memory ops.
>>>>> 
>>>>> In code example (2) which uses scaled addressing above the load
>>>>> instruction is transferring a 32 bit datum (as indicated in the
>>>>> assembly
>>>>> code by the target register being named w1). The load address is the 64
>>>>> bit value in r1 (as indicated in the assembly code by the base register
>>>>> being named x1) plus constant offset 12.
>>>>> 
>>>>> A scaled immediate memory op can embed a 12 bit unsigned displacement.
>>>>> However, the unit size for the embedded value is determined by the size
>>>>> of the accessed datum. So, for a 32 bit load/store the set of offsets
>>>>> which are valid is {4, 8, ... 4 * (2^12 - 1)}. For a byte load the set
>>>>> of valid offsets is {1, 2, ... (2^12-1)}.
>>>>> 
>>>>> Obviously, as the name clearly indicates, there is no dependency on
>>>>> transfer size when using unscaled addressing. Memory ops employing this
>>>>> latter addressing may embed any signed 9-bit displacement which is
>>>>> always counted in a unit size of bytes. So, translation to use unscaled
>>>>> addressing in the AddressLowering phase doesn't present any great
>>>>> problem. However, generating the best code requires implementing both
>>>>> modes.
>>>>> 
>>>>> The upshot is that in order to determine whether a constant index
>>>>> can be
>>>>> replaced by an immediate node the lowering code needs to investigate
>>>>> usages of the address node and establish that all usages employ a
>>>>> unique
>>>>> transfer size.
>>>>> 
>>>>> As you can see in the pull request, the patch includes a method
>>>>> getUsageSize() which does that job. It traverses all usages and returns
>>>>> either a single platform kind which defines a transfer size common to
>>>>> all usages or null if no such unique common kind exists. This leads to
>>>>> the second problem. (n.b. I have finessed Prefetch usages for now
>>>>> because the current AArch64 Graal does not generate code for prefetch.
>>>>> That will need fixing when it is implemented).
>>>>> 
>>>>> Identification of the platform kind requires a generator
>>>>> --------------------------------------------------------
>>>>> 
>>>>> The above requirement explains why the lowering code is implemented as
>>>>> part of the Generate phase instead of in the AddressLowering phase.
>>>>> Identifying the PlatformKind associated with a specific usage of the
>>>>> AddressNode requires translating the stamp of each usage to an LIRKind.
>>>>> That mandates availability of a suitable instance of LIRKindTool
>>>>> (actually an AArch64LIRKindTool) which parameterizes the call to
>>>>> getUsageSize(). The KindTool must be obtained from the
>>>>> AArch64LIRGenerator tool which is obtained form the NodeLIRBuilderTool
>>>>> passed to the AArch64AddressNode generate() method.
>>>>> 
>>>>> I don't doubt that here is some sneaky way of ensuring that the
>>>>> AArch64AddressLowering instance obtains access to an
>>>>> AArch64LIRGenerator
>>>>> (and, thence, an AArch64LIRKindTool) behind the scenes. However,
>>>>> implementing a solution like that does not really seem to me to be the
>>>>> correct solution. There is an assumption in the current code that
>>>>> address lowering can be done prior to the Generate phase without
>>>>> needing
>>>>> to identify whatever transfer datum and/or transfer size is at stake
>>>>> when the address is being used. That's a broken assumption and I would
>>>>> prefer to fix it.
>>>>> 
>>>>> One solution would be to have the generic code precompute the common
>>>>> usage LIRKind and pass it to the implementation class which extends
>>>>> AddressLowering. Alternatively, the abstract parent class could provide
>>>>> a convenience method allowing a suitable PlatformKind to be computed.
>>>>> Either way this means that the parent needs to know about the Arch and
>>>>> know how to obtain an arch-specific KindTool. I'm happy to look into
>>>>> proposing a solution along these lines but I'd like to canvas here for
>>>>> comments before doing so.
>>>>> 
>>>>> Another alternative would be to perform address lowering at a later
>>>>> stage, perhaps in the back end even. This would perhaps require doing
>>>>> some of the transforms done in e.g. AMD64AddressLowering as generic
>>>>> transforms and having a generic AddressNode which allowed for a
>>>>> non-constant index and/or constant displacement. I'm also happy to
>>>>> consider how this might work but I'd probably need more advice about
>>>>> how
>>>>> to go about this.
>>>>> 
>>>>> 
>>>>> The LIRKind of the resulting AddressValue -- is it correct?
>>>>> -----------------------------------------------------------
>>>>> 
>>>>> The patch in the pull request attempts to follow the logic of the
>>>>> preceding code in deriving an LIRKind for whatever AddressNode it
>>>>> constructs, whether using register offset or immediate addressing.
>>>>> However, I am not sure what the original logic is.
>>>>> 
>>>>> In the original code for AArch64AddressNode.generate() the case where
>>>>> the either base or index was null is treated by setting, respectively,
>>>>> baseValue or indexValue to Value.ILLEGAL. Later in that original code
>>>>> this implies that indexReference is also set to Value.ILLEGAL. The
>>>>> LIRKind for the final resulting AArch64AddressValue is computed by
>>>>> executing
>>>>> 
>>>>>    LIRKind kind = LIRKind.combineDerived(tool.getLIRKind(stamp()),
>>>>>                                          baseReference,
>>>>> indexReference);
>>>>> 
>>>>> So, when base is, say, a register and index is null this represents a
>>>>> direct reference through a register with no offset. I would have
>>>>> expected that in this circumstance there was some coherent way of
>>>>> deriving the LIRKind of the address from the LIRKind associated with
>>>>> base. However, because index is null, so indexValue is Value.ILLEGAL
>>>>> and
>>>>> hence indexReference is also Value.ILLEGAL. combineDerived handles this
>>>>> case by returning an unknown reference derived from the stamp(). By
>>>>> contrast, if indexReference was null then combineDerived would compute
>>>>> the combined reference by calling makeDerivedReference(base).
>>>>> 
>>>>> My patch follows this code by starting off with indexValue = null then
>>>>> resetting it to Value.ILLEGAL either if it is null or if it is possible
>>>>> to intervene and replace a constant index with a displacement.
>>>>> Otherwise, when we have a non-null index, indexValue is computed as
>>>>> before by executing
>>>>> 
>>>>>    indexValue = tool.asAllocatable(gen.operand(index));
>>>>> 
>>>>> I have also preserved the original special case processing for
>>>>> AddressingMode.IMMEDIATE_UNSCALED where the indexReference is
>>>>> derived by
>>>>> executing
>>>>> 
>>>>>    indexReference = LIRKind.derivedBaseFromValue(indexValue);
>>>>> 
>>>>> although I'll note that in cases where that addressing mode is
>>>>> introduced by replacing a constant index with an unscaled displacement
>>>>> indexValue will be Value.ILLEGAL and hence the indexReference will be
>>>>> returned as  Value.ILLEGAL which seems to negate the purpose of that
>>>>> special case handling.
>>>>> 
>>>>> I am suspicious that all of this computation seems to be rather
>>>>> redundant anyway. At the point of use of an Address to encode a Load or
>>>>> Store (or possibly a Prefetch) the LIRKind of the address appears to be
>>>>> ignored and instead the LIRKind of the transfer datum is used to
>>>>> generate the load. Is this computation of the derived actually
>>>>> achieving
>>>>> anything important?
>>>>> 
>>>>> regards,
>>>>> 
>>>>> 
>>>>> Andrew Dinn
>>>>> -----------
>>>>> Senior Principal Software Engineer
>>>>> Red Hat UK Ltd
>>>>> Registered in England and Wales under Company Registration No. 03798903
>>>>> Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander
>>