RFR: AArch64: Implementing address lowering to make use of immediate address mode

Mon Mar 20 08:43:16 UTC 2017

Hi Tom,

Thanks for the advice and apologies for not replying sooner -- I have
been chasing a very nasty bug in the (non-graal) jdk8u x86 JVM code for
the last few weeks.

I'll attempt to follow your advice and write an alternative lowering
phase for use by AArch64. However, I'd also like to see if there is a
better approach that either postpones lowering or, at least, unifies it
for the different architectures. I'll let you know of progress on both
fronts as/when available.

regards,

Andrew Dinn
-----------
Senior Principal Software Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander

On 06/03/17 23:31, Tom Rodriguez wrote:
> 
> 
> Andrew Dinn wrote:
>> Hi Tom,
>>
>>
>> On 01/03/17 00:39, Tom Rodriguez wrote:
>>> There's nothing about the AddressLoweringPhase that says the mapping of
>>> old the OffsetAddressNode to machine dependent address mode must be one
>>> to one.  You are free to examine the usages and generate a distinct
>>> address node for each use.  GVN will common any duplicates back
>>> together.  You still will need to select the right forms for the backend
>>> so it doesn't address all the problems you mention.
>>
>> Thanks very much for replying. I'm not quite sure I follow what you are
>> saying but let me try to confirm that I am on the right track.
>>
>> I do understand that each usage of some given OffsetAddress can be
>> substituted /per usage/ by an appropriate AArch64AddressNode without
>> fear of introducing unnecessary replicas. The thrust of your comment
>> seems to be that my current patch doesn't actually need to detect a
>> single, unique LIRKind for all usages in order to be able to embed a
>> constant index as displacement. It could instead traverse all usages
>> and, where possible, replace the OffsetAddress referenced from the usage
>> with a lowered AddressNode employing a displacement constructed
>> according to the LIRKind of that usage.
>>
>> Is that what you are suggesting?
> 
> Yes.
> 
>> If so then I agree that I might perhaps
>> be able tweak the patch to do this. However, that's really somewhat
>> incidental to the concerns I originally raised, perhaps even providing
>> an a fortiori argument for addressing them. The AddressLowering phase is
>> given an AddressNode to lower, a node which may be shared by multiple
>> usages. It is not given a usage (i.e. a node using the AddressNode) and
>> asked whether it wants to provide a lowered address node specific to
>> that usage.
>>
>> So, the problem remains that the lowering model AddressLowering has
>> adopted is based on a misguided notion that lowering can be scheduled
>> per AddressNode without reference to usage. I really want to try to fix
>> the model not the current patch.
> 
> Well the model was adequate for the platforms we supported at the time
> which is why I think it was chosen.  The current
> AddressLoweringPhase/AddressLowering logic is simple enough that I'm not
> sure it really makes sense to make it complex just to support AArch64.
> I'd probably just write a custom phase to do what needs to be done, and
> factor out the creation of the address lowering phase in
> HotSpotSuitesProvider so that an AArch64HotSpotSuitesProvider could
> provide it's own completely custom implementation.  I'd probably
> directly pass in the required AArch64HotSpotLIRKindTool as well which
> you can just construct by hand.  It could that the the kind tool should
> be part of the HotSpotProviders instead of buried in the LIRGenerator
> but it hasn't needed to be more visible until now.  It seems like you
> would have everything required for proper code generation then.
> 
>>
>> If your point is that AddressLowering ought, perhaps, to be rewritten to
>> operate by considering each usage in turn then that would indeed be a
>> better model.
>>
>> Apologies if I am barking up the wrong tree entirely here and have not
>> grasped what you are getting at.
>>
>> Of course, m comments about the need to have access to the usage kind
>> (and, hence, to a KindTool) and my failure to understand how a derived
>> LIRKind should be computed for any lowered AddressNode also still stand
>> in the hope of some comment or explanation which would, of course, be
>> gratefully received.
> 
> As far as the LIRKind of an AddressValue most of the time it is actually
> completely unused.  The purpose of the derived information is for
> tracking when a pointer into the middle of a Java object has been
> materialized as a value so that it can be described to the garbage
> collector or for error checking that a derived value isn't live across a
> safepoint.  The LIRKind of an AddressValue is the LIRKind of the address
> as a computation but the LIRKind of the memory operation is LIRKind of
> the datum being transferred so they are only loosely related.  So the
> only case where the LIRKind of an address value will ever appear as a
> value is for an LEA style operation.
> 
> I suspect I haven't addressed all your concerns about the LIRKind for
> addresses so please follow up with any clarifying questions.
> 
> tom
> 
>>
>> regards,
>>
>>
>> Andrew Dinn
>> -----------
>>
>>> Andrew Dinn wrote:
>>>> I have patched (a slightly out of date version of) Graal to allow
>>>> AArch64 to make use of scaled and unscaled immediate addressing. This
>>>> generates much better code but the implementation is 'wrong' in several
>>>> important respects. A pull request including the change is available
>>>> here
>>>>
>>>>     https://github.com/graalvm/graal-core/pull/258
>>>>
>>>> I have no wish for this change to be included as is - I posted it
>>>> merely
>>>> to enable discussion of what is really needed.
>>>>
>>>> The patch deliberately bypasses the normal AddressLowering phase (I'll
>>>> explain why in a second). Instead it intercepts translation of
>>>> addresses
>>>> during the generate phase. Basically, it attempts to improve addresses
>>>> initially generated as an OffsetAddress (i.e. supplied with 2 Values,
>>>> base and index) when the index is an integer constant whcih can be
>>>> embedded as an immediate displacement (I guess it ought to handle the
>>>> reverse case where base is a constant and index a general Value but
>>>> nothing appears to be generating addresses with constants in that
>>>> order). It does so by 'nulling' the index saving the constant as a
>>>> displacement and resetting the addressing mode from register offset to
>>>> scaled or unscaled immediate.
>>>>
>>>> So, before applying the patch addresses with constant index were
>>>> 'improved' by replacing the constant node with a constant load and
>>>> relying on the use of register_offset addressing mode to do the actual
>>>> memory operation. The result is code like this:
>>>>
>>>>     movz x3, #0xc           # constant load of offset value
>>>>     movk x3, #0x0, lsl #16
>>>>     ldr w1 [x2, x3]         # (1) load via register_offset addressing
>>>>
>>>> The patch tries instead to transform the node to use either scaled or
>>>> unscaled immediate addressing resulting, respectively, in code like
>>>> this:
>>>>
>>>>     ldr w1, [x2,#12]        # (2) load via scaled immediate addressing
>>>>
>>>> or (let's assume we had a negative offset mandating an unscaled load)
>>>>
>>>>     ldr w1, [x2, #-12]      # (3) load via unscaled immediate
>>>> addressing
>>>>
>>>> There are two related reasons why this has been done in the Generate
>>>> phase rather than in the AddressLowering phase. I present them both in
>>>> turn below in order to pose some questions about why AddressLowering is
>>>> currently done as is and to suggest what might be needed to fix it.
>>>>
>>>> I have also include a critique of the way a derived LIRKind is computed
>>>> for the AddressValue returned by method generate(). I am unclear why it
>>>> is currently being computed as is and also unclear what would be an
>>>> appropriate derived value when immediates are encoded. Advice or
>>>> comments regarding the patch and the following critique would be very
>>>> welcome.
>>>>
>>>> Scaling depends on the transfer size
>>>> ------------------------------------
>>>>
>>>> The first problem is that for scaled memory addressing the decision as
>>>> to whether or not a given constant index can be embedded as an
>>>> immediate
>>>> constant in the address node depends upon how the address is used by
>>>> one
>>>> or more memory ops.
>>>>
>>>> In code example (2) which uses scaled addressing above the load
>>>> instruction is transferring a 32 bit datum (as indicated in the
>>>> assembly
>>>> code by the target register being named w1). The load address is the 64
>>>> bit value in r1 (as indicated in the assembly code by the base register
>>>> being named x1) plus constant offset 12.
>>>>
>>>> A scaled immediate memory op can embed a 12 bit unsigned displacement.
>>>> However, the unit size for the embedded value is determined by the size
>>>> of the accessed datum. So, for a 32 bit load/store the set of offsets
>>>> which are valid is {4, 8, ... 4 * (2^12 - 1)}. For a byte load the set
>>>> of valid offsets is {1, 2, ... (2^12-1)}.
>>>>
>>>> Obviously, as the name clearly indicates, there is no dependency on
>>>> transfer size when using unscaled addressing. Memory ops employing this
>>>> latter addressing may embed any signed 9-bit displacement which is
>>>> always counted in a unit size of bytes. So, translation to use unscaled
>>>> addressing in the AddressLowering phase doesn't present any great
>>>> problem. However, generating the best code requires implementing both
>>>> modes.
>>>>
>>>> The upshot is that in order to determine whether a constant index
>>>> can be
>>>> replaced by an immediate node the lowering code needs to investigate
>>>> usages of the address node and establish that all usages employ a
>>>> unique
>>>> transfer size.
>>>>
>>>> As you can see in the pull request, the patch includes a method
>>>> getUsageSize() which does that job. It traverses all usages and returns
>>>> either a single platform kind which defines a transfer size common to
>>>> all usages or null if no such unique common kind exists. This leads to
>>>> the second problem. (n.b. I have finessed Prefetch usages for now
>>>> because the current AArch64 Graal does not generate code for prefetch.
>>>> That will need fixing when it is implemented).
>>>>
>>>> Identification of the platform kind requires a generator
>>>> --------------------------------------------------------
>>>>
>>>> The above requirement explains why the lowering code is implemented as
>>>> part of the Generate phase instead of in the AddressLowering phase.
>>>> Identifying the PlatformKind associated with a specific usage of the
>>>> AddressNode requires translating the stamp of each usage to an LIRKind.
>>>> That mandates availability of a suitable instance of LIRKindTool
>>>> (actually an AArch64LIRKindTool) which parameterizes the call to
>>>> getUsageSize(). The KindTool must be obtained from the
>>>> AArch64LIRGenerator tool which is obtained form the NodeLIRBuilderTool
>>>> passed to the AArch64AddressNode generate() method.
>>>>
>>>> I don't doubt that here is some sneaky way of ensuring that the
>>>> AArch64AddressLowering instance obtains access to an
>>>> AArch64LIRGenerator
>>>> (and, thence, an AArch64LIRKindTool) behind the scenes. However,
>>>> implementing a solution like that does not really seem to me to be the
>>>> correct solution. There is an assumption in the current code that
>>>> address lowering can be done prior to the Generate phase without
>>>> needing
>>>> to identify whatever transfer datum and/or transfer size is at stake
>>>> when the address is being used. That's a broken assumption and I would
>>>> prefer to fix it.
>>>>
>>>> One solution would be to have the generic code precompute the common
>>>> usage LIRKind and pass it to the implementation class which extends
>>>> AddressLowering. Alternatively, the abstract parent class could provide
>>>> a convenience method allowing a suitable PlatformKind to be computed.
>>>> Either way this means that the parent needs to know about the Arch and
>>>> know how to obtain an arch-specific KindTool. I'm happy to look into
>>>> proposing a solution along these lines but I'd like to canvas here for
>>>> comments before doing so.
>>>>
>>>> Another alternative would be to perform address lowering at a later
>>>> stage, perhaps in the back end even. This would perhaps require doing
>>>> some of the transforms done in e.g. AMD64AddressLowering as generic
>>>> transforms and having a generic AddressNode which allowed for a
>>>> non-constant index and/or constant displacement. I'm also happy to
>>>> consider how this might work but I'd probably need more advice about
>>>> how
>>>> to go about this.
>>>>
>>>>
>>>> The LIRKind of the resulting AddressValue -- is it correct?
>>>> -----------------------------------------------------------
>>>>
>>>> The patch in the pull request attempts to follow the logic of the
>>>> preceding code in deriving an LIRKind for whatever AddressNode it
>>>> constructs, whether using register offset or immediate addressing.
>>>> However, I am not sure what the original logic is.
>>>>
>>>> In the original code for AArch64AddressNode.generate() the case where
>>>> the either base or index was null is treated by setting, respectively,
>>>> baseValue or indexValue to Value.ILLEGAL. Later in that original code
>>>> this implies that indexReference is also set to Value.ILLEGAL. The
>>>> LIRKind for the final resulting AArch64AddressValue is computed by
>>>> executing
>>>>
>>>>     LIRKind kind = LIRKind.combineDerived(tool.getLIRKind(stamp()),
>>>>                                           baseReference,
>>>> indexReference);
>>>>
>>>> So, when base is, say, a register and index is null this represents a
>>>> direct reference through a register with no offset. I would have
>>>> expected that in this circumstance there was some coherent way of
>>>> deriving the LIRKind of the address from the LIRKind associated with
>>>> base. However, because index is null, so indexValue is Value.ILLEGAL
>>>> and
>>>> hence indexReference is also Value.ILLEGAL. combineDerived handles this
>>>> case by returning an unknown reference derived from the stamp(). By
>>>> contrast, if indexReference was null then combineDerived would compute
>>>> the combined reference by calling makeDerivedReference(base).
>>>>
>>>> My patch follows this code by starting off with indexValue = null then
>>>> resetting it to Value.ILLEGAL either if it is null or if it is possible
>>>> to intervene and replace a constant index with a displacement.
>>>> Otherwise, when we have a non-null index, indexValue is computed as
>>>> before by executing
>>>>
>>>>     indexValue = tool.asAllocatable(gen.operand(index));
>>>>
>>>> I have also preserved the original special case processing for
>>>> AddressingMode.IMMEDIATE_UNSCALED where the indexReference is
>>>> derived by
>>>> executing
>>>>
>>>>     indexReference = LIRKind.derivedBaseFromValue(indexValue);
>>>>
>>>> although I'll note that in cases where that addressing mode is
>>>> introduced by replacing a constant index with an unscaled displacement
>>>> indexValue will be Value.ILLEGAL and hence the indexReference will be
>>>> returned as  Value.ILLEGAL which seems to negate the purpose of that
>>>> special case handling.
>>>>
>>>> I am suspicious that all of this computation seems to be rather
>>>> redundant anyway. At the point of use of an Address to encode a Load or
>>>> Store (or possibly a Prefetch) the LIRKind of the address appears to be
>>>> ignored and instead the LIRKind of the transfer datum is used to
>>>> generate the load. Is this computation of the derived actually
>>>> achieving
>>>> anything important?
>>>>
>>>> regards,
>>>>
>>>>
>>>> Andrew Dinn
>>>> -----------
>>>> Senior Principal Software Engineer
>>>> Red Hat UK Ltd
>>>> Registered in England and Wales under Company Registration No. 03798903
>>>> Directors: Michael Cunningham, Michael ("Mike") O'Neill, Eric Shander
>