[aarch64-port-dev ] [RFC] ldp/stp peephole optimizations

Fri Jan 12 22:28:45 UTC 2018

Hi Zhongwei,

Great idea!

Here are some small comments on part A, stack spilling. I know you're thinking about a different approach to part "B", so I'm not sure if that means you're redoing "A" as well. These comments may only partially apply.

In aarch64.ad, MachSpillCopyNode::peephole(), around lines 3416-3419:
+  if (((src_lo_rc == rc_stack && dst_lo_rc == rc_int)
+      || (dst_lo_rc == rc_stack && src_lo_rc == rc_int))
+      && ((inst1_src_lo_rc == rc_stack && inst1_dst_lo_rc == rc_int)
+          || (inst1_dst_lo_rc == rc_stack && inst1_src_lo_rc == rc_int)))

This seems to match a lo-spill and a hi-unspill (or the reverse), instead of only both being spills or both unspills.

Also, I think it would be good to factor out the ldp/stp offset range checks into a separate function (inlined). Then use that predicate, instead of inlined checks such as the new ones you added in spill() and unspill() in macroAssembler_aarch64.hpp, as well as the slightly odd pre-existing checks in spill_copy128(). 

And although these checks are only testing the upper-bound (which is OK for stack offsets), more generally we'd want to check both lower and upper bounds, and you could probably make use of the predicate in the part-B patch too.

 - Derek

> -----Original Message-----
> From: aarch64-port-dev [mailto:aarch64-port-dev-
> bounces at openjdk.java.net] On Behalf Of Zhongwei Yao
> Sent: Friday, December 22, 2017 3:03 AM
> To: hotspot-compiler-dev at openjdk.java.net; aarch64-port-
> dev at openjdk.java.net
> Subject: [aarch64-port-dev ] [RFC] ldp/stp peephole optimizations
> 
> Hi,
> 
> We are planning to add AArch64 LDP/STP (load/store pai[Derek] lines 3416034-19r of registers)
> support in C2 code-gen for better performance. I think the LDP/STP can be
> used in following cases:
> A). For register spill/unspill. We've observed many sequential single stack
> load/store patterns in SPECjbb C2 generated code.
> B). Besides spilling, LDP is also not generated generally for multiple
> LoadI/LoadL nodes. Is there any risk (e.g. implicit check?) for combing them
> together, apart from alignment issue?
> 
> I think peephole is the best fit for above optimization (gcc/llvm also has such
> peephole optimization). However, current peephole rules in C2 compiler is
> very limited and I doubt whether it really takes effect -
> AArch64 has disabled peephole optimizations. x86 has enabled it, but the
> instruction sequences to be matched by the rules seems to be very
> uncommon.
> 
> To address issue A), since current spill/unspill are handled by common
> MachSpillCopyNode, I was thinking if we could add peephole rule to match
> MachSpillCopyNode, but MachSpillCopyNode has no operands (e.g.
> mem, src, dst) like ordinary instruct defined in aarch64.ad. Even we may
> extract them (mem, src, dst) like in MachSpillCopyNode::implementation(),
> and even we can extend current peephole rule grammar, expressing such
> extraction in peephole's grammar is complex.
> So I prefer adding following manually defined method peephole() to
> MachSpillCopyNode:
> 
>     virtual MachNode *peephole(Block *block, int block_index, PhaseRegAlloc
> *ra_, int &deleted);
> 
> This makes the patch relative simple. My prototype patch for A) (still some
> TODOs and hardcodes, but it works fine):
>     http://cr.openjdk.java.net/~zyao/RFC_A/
> 
> To address issue B) is somewhat complicated, we need to extend current
> peephole rule syntax, as I don't think current simple syntax works for any
> useful peephole optimizations like ldp/stp opt.
> 
> My extended syntax - at least works for ldp/stp optimizations:
> 
> ------
>   peepmatch ( loadI loadI );
>   peepconstraint (0.mem$base == 1.mem$base, 0.mem$scale ==
> 1.mem$scale, 0.mem$disp - 4 == 1.mem$disp, 0.dst != 1.dst); // new
> grammar is described below.
>   peepreplace (loadPairI(1.mem 1.mem))
> ------
> 
> But for loadPairI, it is hard to express in current instruct semantic.
> Because current instruct in aarch64.ad is defined by a match rule. The match
> rule is an expression tree and made of Ideal Node.
> However, LDP instruction doesn't have Ideal Node (say LoadPair) to match.
> And adding load pair node to arch-independent Ideal node seems strange.
> 
> My proposed solution is: add a special arch dependent operand like
> iRegIpair:
> 
> ------
>   operand iRegIpair(iRegI reg1, iRegI reg2)
>   %{
>    constraint(ALLOC_IN_RC(any_reg32));
>    op_cost(0);
>    format %{ "pair: reg1, reg2"%}; // hard coded format for now.
>    interface(REG_INTER);
>   %}
> ------
> 
> This needs to update ADLC to support iRegIpair operand. Because unlike
> current operand which has 1 register, iRegIpair has 2.
> 
> Then use it as loadPairI's operand type like:
> 
> ------
> instruct loadPairI(indOffI mem, iRegIpair dst) %{
>   match(Set dst mem); //no Ideal Node in match rule.
>   ...
> 
> %}
> ------
> 
> Then we can use loadPairI in peephole rule's "peepreplace".
> 
> Since only constraints between operands are supported in peephole rule. But
> to check whether the adjacent loads are loaded from adjacent memory
> address, we need to check operand's member, like (0.mem$disp -
> 4 == 1.mem$disp), My solution is: add new grammar like 0.mem$disp to
> extract member in operand in ADLC (peep_constraint_parse()).
> 
> Another issue for peephole optimization is that it only matches adjacent
> instructions in the same basic block. This leads to many missing matches
> when loads are not scheduled to adjacent.
> So I propose to delay peephole phase to the place just before final code emit
> (the fill_buffer() function). This place is after instruction scheduling. So after
> instruction scheduling, we could match more adjacent loads.
> 
> My draft patch to address B) is at:
>   http://cr.openjdk.java.net/~zyao/RFC_B/
> 
> What do you think? Welcome any feedback!
> 
> --
> Best regards,
> Zhongwei