[RFC] ldp/stp peephole optimizations

Zhongwei Yao zhongwei.yao at linaro.org
Fri Dec 22 08:02:39 UTC 2017


We are planning to add AArch64 LDP/STP (load/store pair of registers)
support in C2 code-gen for better performance. I think the LDP/STP can
be used in following cases:
A). For register spill/unspill. We've observed many sequential single
stack load/store patterns in SPECjbb C2 generated code.
B). Besides spilling, LDP is also not generated generally for multiple
LoadI/LoadL nodes. Is there any risk (e.g. implicit check?) for
combing them together, apart from alignment issue?

I think peephole is the best fit for above optimization (gcc/llvm also
has such peephole optimization). However, current peephole rules in C2
compiler is very limited and I doubt whether it really takes effect -
AArch64 has disabled peephole optimizations. x86 has enabled it, but
the instruction sequences to be matched by the rules seems to be very

To address issue A), since current spill/unspill are handled by common
MachSpillCopyNode, I was thinking if we could add peephole rule to
match MachSpillCopyNode, but MachSpillCopyNode has no operands (e.g.
mem, src, dst) like ordinary instruct defined in aarch64.ad. Even we
may extract them (mem, src, dst) like in
MachSpillCopyNode::implementation(), and even we can extend current
peephole rule grammar, expressing such extraction in peephole's
grammar is complex.
So I prefer adding following manually defined method peephole() to

    virtual MachNode *peephole(Block *block, int block_index,
PhaseRegAlloc *ra_, int &deleted);

This makes the patch relative simple. My prototype patch for A) (still
some TODOs and hardcodes, but it works fine):

To address issue B) is somewhat complicated, we need to extend current
peephole rule syntax, as I don't think current simple syntax works for
any useful peephole optimizations like ldp/stp opt.

My extended syntax - at least works for ldp/stp optimizations:

  peepmatch ( loadI loadI );
  peepconstraint (0.mem$base == 1.mem$base, 0.mem$scale ==
1.mem$scale, 0.mem$disp - 4 == 1.mem$disp, 0.dst != 1.dst); // new
grammar is described below.
  peepreplace (loadPairI(1.mem 1.mem))

But for loadPairI, it is hard to express in current instruct semantic.
Because current instruct in aarch64.ad is defined by a match rule. The
match rule is an expression tree and made of Ideal Node.
However, LDP instruction doesn't have Ideal Node (say LoadPair) to
match. And adding load pair node to arch-independent Ideal node seems

My proposed solution is: add a special arch dependent operand like iRegIpair:

  operand iRegIpair(iRegI reg1, iRegI reg2)
   format %{ "pair: reg1, reg2"%}; // hard coded format for now.

This needs to update ADLC to support iRegIpair operand. Because unlike
current operand which has 1 register, iRegIpair has 2.

Then use it as loadPairI's operand type like:

instruct loadPairI(indOffI mem, iRegIpair dst)
  match(Set dst mem); //no Ideal Node in match rule.


Then we can use loadPairI in peephole rule's "peepreplace".

Since only constraints between operands are supported in peephole
rule. But to check whether the adjacent loads are loaded from adjacent
memory address, we need to check operand's member, like (0.mem$disp -
4 == 1.mem$disp), My solution is: add new grammar like 0.mem$disp to
extract member in operand in ADLC (peep_constraint_parse()).

Another issue for peephole optimization is that it only matches
adjacent instructions in the same basic block. This leads to many
missing matches when loads are not scheduled to adjacent.
So I propose to delay peephole phase to the place just before final
code emit (the fill_buffer() function). This place is after
instruction scheduling. So after instruction scheduling, we could
match more adjacent loads.

My draft patch to address B) is at:

What do you think? Welcome any feedback!

Best regards,

More information about the hotspot-compiler-dev mailing list