[aarch64-port-dev ] [RFC] ldp/stp peephole optimizations

Fri Dec 22 11:45:37 UTC 2017

Hi Zhongwei,

I'm not a reviewer. Thank you, it is a great idea to merge loads or 
stores that are issued sequentially. Can you share some more data on that?

Is there a micro-benchmark or some sample program than shows better 
performance on some hardware?
What are the numbers observed?
You mention SPECjbb, is it 2005 or 2015? Which configuration?
In SPECjbb dou you see most sequential series only related to stack work?

-Dmitry

On 12/22/2017 11:02 AM, Zhongwei Yao wrote:
> Hi,
>
> We are planning to add AArch64 LDP/STP (load/store pair of registers)
> support in C2 code-gen for better performance. I think the LDP/STP can
> be used in following cases:
> A). For register spill/unspill. We've observed many sequential single
> stack load/store patterns in SPECjbb C2 generated code.
> B). Besides spilling, LDP is also not generated generally for multiple
> LoadI/LoadL nodes. Is there any risk (e.g. implicit check?) for
> combing them together, apart from alignment issue?
>
> I think peephole is the best fit for above optimization (gcc/llvm also
> has such peephole optimization). However, current peephole rules in C2
> compiler is very limited and I doubt whether it really takes effect -
> AArch64 has disabled peephole optimizations. x86 has enabled it, but
> the instruction sequences to be matched by the rules seems to be very
> uncommon.
>
> To address issue A), since current spill/unspill are handled by common
> MachSpillCopyNode, I was thinking if we could add peephole rule to
> match MachSpillCopyNode, but MachSpillCopyNode has no operands (e.g.
> mem, src, dst) like ordinary instruct defined in aarch64.ad. Even we
> may extract them (mem, src, dst) like in
> MachSpillCopyNode::implementation(), and even we can extend current
> peephole rule grammar, expressing such extraction in peephole's
> grammar is complex.
> So I prefer adding following manually defined method peephole() to
> MachSpillCopyNode:
>
>      virtual MachNode *peephole(Block *block, int block_index,
> PhaseRegAlloc *ra_, int &deleted);
>
> This makes the patch relative simple. My prototype patch for A) (still
> some TODOs and hardcodes, but it works fine):
>      http://cr.openjdk.java.net/~zyao/RFC_A/
>
> To address issue B) is somewhat complicated, we need to extend current
> peephole rule syntax, as I don't think current simple syntax works for
> any useful peephole optimizations like ldp/stp opt.
>
> My extended syntax - at least works for ldp/stp optimizations:
>
> ------
>    peepmatch ( loadI loadI );
>    peepconstraint (0.mem$base == 1.mem$base, 0.mem$scale ==
> 1.mem$scale, 0.mem$disp - 4 == 1.mem$disp, 0.dst != 1.dst); // new
> grammar is described below.
>    peepreplace (loadPairI(1.mem 1.mem))
> ------
>
> But for loadPairI, it is hard to express in current instruct semantic.
> Because current instruct in aarch64.ad is defined by a match rule. The
> match rule is an expression tree and made of Ideal Node.
> However, LDP instruction doesn't have Ideal Node (say LoadPair) to
> match. And adding load pair node to arch-independent Ideal node seems
> strange.
>
> My proposed solution is: add a special arch dependent operand like iRegIpair:
>
> ------
>    operand iRegIpair(iRegI reg1, iRegI reg2)
>    %{
>     constraint(ALLOC_IN_RC(any_reg32));
>     op_cost(0);
>     format %{ "pair: reg1, reg2"%}; // hard coded format for now.
>     interface(REG_INTER);
>    %}
> ------
>
> This needs to update ADLC to support iRegIpair operand. Because unlike
> current operand which has 1 register, iRegIpair has 2.
>
> Then use it as loadPairI's operand type like:
>
> ------
> instruct loadPairI(indOffI mem, iRegIpair dst)
> %{
>    match(Set dst mem); //no Ideal Node in match rule.
>    ...
>
> %}
> ------
>
> Then we can use loadPairI in peephole rule's "peepreplace".
>
> Since only constraints between operands are supported in peephole
> rule. But to check whether the adjacent loads are loaded from adjacent
> memory address, we need to check operand's member, like (0.mem$disp -
> 4 == 1.mem$disp), My solution is: add new grammar like 0.mem$disp to
> extract member in operand in ADLC (peep_constraint_parse()).
>
> Another issue for peephole optimization is that it only matches
> adjacent instructions in the same basic block. This leads to many
> missing matches when loads are not scheduled to adjacent.
> So I propose to delay peephole phase to the place just before final
> code emit (the fill_buffer() function). This place is after
> instruction scheduling. So after instruction scheduling, we could
> match more adjacent loads.
>
> My draft patch to address B) is at:
>    http://cr.openjdk.java.net/~zyao/RFC_B/
>
> What do you think? Welcome any feedback!
>