ppc C2 optimization possibilities

Mon Oct 10 21:09:11 UTC 2016

Hi Gustavo,

I post this to hotspot-compiler-dev, I guess that's where it belongs :)
find my comment inline:

> 1. Use indexed loads efficiently (no addi before a ld, for instance);
Indexed loads change two registers, index and loaded value, right?
This is almost impossible to express in adl, and the matcher can not handle
it, as it only matches DAGs.  There is a special case for DivMod in x86.
Load might be more complicated because it has memory edges.

> 2. Use of dot instructions to avoid an additional compare instruction;
I'm not familiar with dot instructions, I assume it's an arithmetic instruction
that sets a condition code?
Similarly as 1.) the dot instruction sets two (condition) registers. It would be possible to match 
branch, compare and add into one MachNode.  But you need all combinations ...

An alternative would be to add peephole rules for this, see ad file. We did not 
explore them at all.  They are run as a very last step, and you can rewrite
assembler sequences.

> 3. Improve Hoisting and CSE (Common Sub-expression Elimination) in some
>    particular cases;

Yep, there is room for improvement. Probably a shared change.

> 4. Avoid unnecessary reload of constants;
The reload is probably the result of rematerialization during register 
allocation. You might see that with PrintOptoAssembly.  This is hard to 
fix in the register allocation, might be possible in a pass thereafter.
Other of the cases you mention below are 'different' constants, i.e., 
different in type.  Could be only removed by peephole optimization.

> 5. Eliminate some redundant register moves;
They might be due to different types, see below.

> 6. Eliminate some redundant compare sequences.
> 
> I've just started to have a deeper look at it. A partner (Eldorado) is also
> helping to address these issues.
> 
> I'm trying to get more familiar with the AD language since I believe most
> changes are related to the match rule in the IR and the instruction
> emission determined in the ppc.ad file. But I could not find a much
> comprehensive documentation on it, even on the Wiki (maybe on some
> book?).
> 
> If you have any advice or any comment on were/what I should look firstly in
> the code (which phase on the compilation chain) I would very glad.
> 
> Thank you!
> 
> 
> Kind regards,
> Gustavo
> 
> 
> [From MunninPageCursor]
> === For addressing, omit addis with 0, use the existing base register (r22
> here).
> 2244306  0.3227 :    3fff60240acc:      addis   r17,r22,0
>  361236  0.0519 :    3fff60240ad0:      ld      r17,0(r17)

You need to look at -XX:+PrintOptoAssembly.
I would assume that you are running in 32-bit compressed oop mode 
(check -XX:+PrintCompressedOopsMode) and 
r22 is a compressed oop. Then DecodeN is matched into AddP, 
and r17 holds an oop.  The IR doesn't grok that it's the same value
if it has different types.  This should mostly be an issue in 32-bit oop mode, 
with works only up to 4G heap.  Run with -XX:HeapBaseMinAddress = 4G / 32G / 33G 
to get the different compressed oop modes, which should be better with this.
The copies will then be replaced by the decoding.
It might also be affected by settings/transformations for NullCheckElimination.

We tried to implement an optimization for 32-bit compressed oop mode, 
part of which are in openJdk.  I never contributed the shared changes because
it showed no effect at all.  It removed only 30% of the useless copies, and introduced
additional MachSpillCopies.
One should not issue DecodeN / EncodeP in the frontend at all, and add new 
Ideal nodes to the frontend that do LoadN2P and StoreP2N. Then all 
oop->cOop and cOop->oop copies would be gone.  This is quite a shared
change, though.  jdk10?

> === Use dot for extend sign operation instead of following with compare
> against 0.
>  390635  0.0562 :    3fff60240af0:      extsb   r17,r18
>  923383  0.1328 :    3fff60240af4:      cmpwi   cr6,r17,0
Must see the MachNodes for this. Might be possible to 
match.

> === Use cmpdi instead of li/cmpd (r14 killed after this).
>  240944  0.0346 :    3fff60240bec:      li      r14,0
>    6943 1.0e-03 :    3fff60240bf0:      cmpld   cr6,r3,r14
Also must see PrintOptoAssembly. This should not happen. There are rules
for this, for example cmpI_reg_imm16
which matches a register (iRegIsrc src1) and a 16-bit immediate (immI16 src2), which itself
matches a ConI node (see decl of operand immI16) as inputs. The node matched is CmpI. It sets
a condition register (flagsReg crx).  Look for CmpL, CmpP, CmpN etc in match() statement.
Those match ideal "class CmpLNode" etc.
The resulting node is "class cmpI_reg_imm16Node" in ad_ppc64.hpp.  The match rule is 
hard to read/find in the generated code.

> === Omit sign extension when comparing to 0, compare with value from zero
> extended load.
>  179491  0.0258 :    3fff60240aec:      lbz     r18,0(r15)
>  390635  0.0562 :    3fff60240af0:      extsb   r17,r18
>  923383  0.1328 :    3fff60240af4:      cmpwi   cr6,r17,0

> === Hoist CSE of constant load.
>    (Original Sequence)
>  852288  0.1225 :    3fff60240c20:      cmpwi   cr6,r15,0
>        140 2.0e-05 :    3fff60240c24:      beq     cr6,3fff60240c30
>            1 1.4e-07 :    3fff60240c28:      li      r15,0
>                              :    3fff60240c2c:      stb     r15,0(r14)
> 1374509  0.1976 :    3fff60240c30:      li      r15,0
>    4224 6.1e-04 :    3fff60240c34:      stw     r15,60(r23)
> 
>    (Optimized Sequence):
>  852288  0.1225 :    3fff60240c20:      cmpwi   cr6,r15,0
>                 :    3fff60240c24:      li      r15,0		<if register selection can
> avoid collision with cmpwi's registers, other placement possible>
>                 :    3fff60240c28:      beq     cr6,3fff60240c34
>                 :    3fff60240c30:      stb     r15,0(r14)
>                 :    3fff60240c34:      stw     r15,60(r23)
The problem is that the first is Byte, the second is Int. Two different
constants for C2.  You would have to transform 

    ConI(0)     ConB(0)
        |               |
        |               |
      useI          useB

to 
                 ConB(0)
               /            \
              /              \
      ConvB2I        useB
           |
           |
         useI

But this will hinder other optimizations as matching ConI as immediate operand.

Basically, we implemented a scheduler on ia64. It did a lot, but the hoisting payed
off the most. But it required a lot of shared changes. So I think this would 
have an effect on ppc, too.  You can not express hoisting as match rule, though.

We also have a cse phase during gcm, that's pretty stand-alone, but
I can't tell whether it pays off.

Basically, the placement is done during gcm, and then register allocation
runs. After gcm, mostly only rematerialization moves code.
Final code 'movement' can be done by a intra-block scheduler (-XX:+OptoScheduling) 
which is off as I encountered a wrong schedule, see c2_globals_ppc.hpp.
Also, for this the pipeline specifications in the ad file must be improved.
OptoScheduling and Peephole optimizations are exclusive.

> [From StoreNodeRelationshipCursor;nextChainStart]
> === Avoid reload of constant (no branches into the following code)
>  728853  0.1048 :    3fff60260f5c:      li      r15,0		### Load constant
> into r17 instead of r15, since r15 is clobbered prior to last stw (...0f84)
>    6864 9.9e-04 :    3fff60260f60:      stb     r15,64(r17)
>       1 1.4e-07 :    3fff60260f64:      li      r14,0		### Delete
>  539571  0.0776 :    3fff60260f68:      std     r14,72(r17)	###    (Use r17)
>    4224 6.1e-04 :    3fff60260f6c:      lis     r14,0		### Delete
>  262344  0.0377 :    3fff60260f70:      std     r14,48(r17)	###    (use r17)
>  665350  0.0957 :    3fff60260f74:      ld      r14,40(r1)
>    4792 6.9e-04 :    3fff60260f78:      lbz     r15,93(r14)
>  199625  0.0287 :    3fff60260f7c:      std     r22,112(r1)
>  852368  0.1225 :    3fff60260f80:      li      r17,0		### Delete
>    9512  0.0014 :    3fff60260f84:      stw     r17,88(r14)
>  232750  0.0335 :    3fff60260f88:      ld      r17,112(r1)
> 
> 
> [From RelationshipGroupStore;readRecord]
> === Use indexed load in place of add followed by load with 0 offset.
> 131733  0.0189 :    3fff6025b684:       add     r14,r20,r14	### delete
> 359529  0.0517 :    3fff6025b688:       lbz     r14,0(r14)	### do lbz
> r14,r20,r14
> 
> 
> === Eliminate redundant move.
>  111964  0.0161 :    3fff6025b63c:      ld      r14,48(r22)	### load into r20 in
> the first place
>  266837  0.0384 :    3fff6025b640:      extsw   r18,r31
> 3965285  0.5701 :    3fff6025b644:      mr      r20,r14		### delete
> 1104317  0.1588 :    3fff6025b648:      add     r14,r20,r18	### (r14 killed here)
> 
> 
> [From StorePropertyCursor;next]
> === Eliminate redundant compare (note cmpwi below for values 7, 3, 1, with
> a repetition each)
>   18451  0.0027 :    3fff602f1b70:      cmpwi   cr5,r15,7
>                 :    3fff602f1b74:      beq     cr5,3fff602f1cbc
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0x1bc>
>   70972  0.0102 :    3fff602f1b78:      cmpwi   cr6,r15,7
>   14754  0.0021 :    3fff602f1b7c:      bgt     cr6,3fff602f1c30
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0x130>
>   97107  0.0140 :    3fff602f1b80:      cmpwi   cr5,r15,3
>   33755  0.0049 :    3fff602f1b84:      beq     cr5,3fff602f1c1c
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0x11c>
>      51 7.3e-06 :    3fff602f1b88:      cmpwi   cr6,r15,3
>   68819  0.0099 :    3fff602f1b8c:      bgt     cr6,3fff602f1bd0
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0xd0>
>                 :    3fff602f1b90:      cmpwi   cr5,r15,1
>                 :    3fff602f1b94:      beq     cr5,3fff602f1bbc
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0xbc>
>                 :    3fff602f1b98:      cmpwi   cr6,r15,1
>                 :    3fff602f1b9c:       ble     cr6,3fff602f1bb4
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0xb4>
> 
> === Two things here:
> ===   1) Redundant move "mr r3,r14" can be eliminated if load goes into r3 to
> begin with.
> ===   2) the "ld r14,56(r1) repeated below can be treated as a CSE and done
> prior to the first compare (r14 is dead @ ...f1cc8)
>          (The case at 3fff602f1bb4 doesn't require it, but since it is a load from
> the stack-frame its ok to perform)
>  18451  0.0027 :    3fff602f1b70:       cmpwi   cr5,r15,7
>                :    3fff602f1b74:       beq     cr5,3fff602f1cbc
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0x1bc>
>  70972  0.0102 :    3fff602f1b78:       cmpwi   cr6,r15,7
>  14754  0.0021 :    3fff602f1b7c:       bgt     cr6,3fff602f1c30
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0x130>
>  97107  0.0140 :    3fff602f1b80:       cmpwi   cr5,r15,3
>  33755  0.0049 :    3fff602f1b84:       beq     cr5,3fff602f1c1c
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0x11c>
>     51 7.3e-06 :    3fff602f1b88:       cmpwi   cr6,r15,3
>  68819  0.0099 :    3fff602f1b8c:       bgt     cr6,3fff602f1bd0
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0xd0>
>                :    3fff602f1b90:       cmpwi   cr5,r15,1
>                :    3fff602f1b94:       beq     cr5,3fff602f1bbc
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0xbc>
>                :    3fff602f1b98:       cmpwi   cr6,r15,1
>                :    3fff602f1b9c:       ble     cr6,3fff602f1bb4
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0xb4>
>                :    3fff602f1ba0:       ld      r14,56(r1)	###(2) CSE
>                :    3fff602f1ba4:       addis   r14,r14,0	### Delete
>                :    3fff602f1ba8:       ld      r14,0(r14)
>                :    3fff602f1bac:       mr      r3,r14		### (1) Redundant
> move; load into r3 instead of r14
>                :    3fff602f1bb0:       b       3fff602f1cc8
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0x1c8>
>                :    3fff602f1bb4:       li      r3,0
>                :    3fff602f1bb8:       b       3fff602f1cc8
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0x1c8>
>                :    3fff602f1bbc:       ld      r14,56(r1)	###(2) CSE
>                :    3fff602f1bc0:       addis   r14,r14,0	### Delete
>                :    3fff602f1bc4:       ld      r14,8(r14)
>                :    3fff602f1bc8:       mr      r3,r14		### (1) Redundant
> move; load into r3 instead of r14
>                :    3fff602f1bcc:       b       3fff602f1cc8
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0x1c8>
>  21303  0.0031 :    3fff602f1bd0:       cmpwi   cr5,r15,5
>  66514  0.0096 :    3fff602f1bd4:       beq     cr5,3fff602f1c08
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0x108>
>     67 9.6e-06 :    3fff602f1bd8:       cmpwi   cr6,r15,5
> 183659  0.0264 :    3fff602f1bdc:       ble     cr6,3fff602f1bf4
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0xf4>
>    248 3.6e-05 :    3fff602f1be0:       ld      r14,56(r1)	###(2) CSE
>                :    3fff602f1be4:       addis   r14,r14,0	### Delete
>     22 3.2e-06 :    3fff602f1be8:       ld      r14,16(r14)
>  38188  0.0055 :    3fff602f1bec:       mr      r3,r14		### (1)
> Redundant move; load into r3 instead of r14
>   6576 9.5e-04 :    3fff602f1bf0:       b       3fff602f1cc8
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0x1c8>
>                :    3fff602f1bf4:       ld      r14,56(r1)	###(2) CSE
>                :    3fff602f1bf8:       addis   r14,r14,0	### Delete
>                :    3fff602f1bfc:       ld      r14,24(r14)
>                :    3fff602f1c00:       mr      r3,r14		### (1) Redundant
> move; load into r3 instead of r14
>                :    3fff602f1c04:       b       3fff602f1cc8
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0x1c8>
>   4754 6.8e-04 :    3fff602f1c08:       ld      r14,56(r1)	###(2) CSE
>                :    3fff602f1c0c:       addis   r14,r14,0	### Delete
>      6 8.6e-07 :    3fff602f1c10:       ld      r14,32(r14)
>   1011 1.5e-04 :    3fff602f1c14:       mr      r3,r14		### (1) Redundant
> move; load into r3 instead of r14
>      6 8.6e-07 :    3fff602f1c18:       b       3fff602f1cc8
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0x1c8>
>                :    3fff602f1c1c:       ld      r14,56(r1)	###(2) CSE
>                :    3fff602f1c20:       addis   r14,r14,0	### Delete
>                :    3fff602f1c24:       ld      r14,40(r14)
>                :    3fff602f1c28:       mr      r3,r14		### (1) Redundant
> move; load into r3 instead of r14
>                :    3fff602f1c2c:       b       3fff602f1cc8

I don't really remember the issue with compares ... one thing is that
the  register allocation can not spill the condition code. Therefore  it must be able to 
rematerialize the compare.  Maybe I will remember ... I saw it 
happen that Cmp results were reused somewhere.

> [RelationshipStore;readRecord]
> === Use andis. in place of following sequence.
>    845 1.2e-04 :    3fff6027e134:       lis     r18,7
>  36054  0.0052 :    3fff6027e138:       and     r18,r17,r18
Should be matched  by andI_reg_uimm16 or similar.

> === Another redundant move case, slightly more spread.
>     21 3.0e-06 :    3fff6027de44:       ld      r14,48(r11)	### This load should
> target r28
> 201938  0.0290 :    3fff6027de48:       addis   r15,r24,0
>  56239  0.0081 :    3fff6027de4c:       ld      r15,16(r15)
> 698530  0.1004 :    3fff6027de50:       lwa     r15,32(r15)
>     51 7.3e-06 :    3fff6027de54:       mr      r28,r14		### Delete
> 12361531  1.7772 :    3fff6027de58:     extsw   r14,r25		### (Kills r14)
Yes, bad register allocation.

> === Use dot variant and eliminate compare with 0, must be ok with using cr7
> 301138  0.0433 :    3fff6027e06c:       rldic   r15,r15,10,22	### Use dot
> variant into cr7
>  17978  0.0026 :    3fff6027e070:       cmpdi   cr5,r15,0	### Delete - branch
> must use cr7 though

> === lis of 0 is same as li 0, eliminate redundancy.
>  40511  0.0058 :    3fff60285d9c:       lis     r15,0		### Move "li r5,0"
> here (replacing lis)
> 123000  0.0177 :    3fff60285da0:       std     r15,48(r14)	### Change to use r5
>  42789  0.0062 :    3fff60285da4:       ld      r3,72(r1)
>  16111  0.0023 :    3fff60285da8:       li      r5,0		### r5 must be set
> for call, so set it above and eliminate lis.
>     36 5.2e-06 :    3fff60285dac:       bl      3fff60240a90
Again, different types of constants I assume.  Candidate for Peephole.

General advice:

To understand adl: Look at the 'instruct' declarations.

Their names are pretty standardized, trying to say what they match.
The 'match' statement is the match tree, other parts are in 
the operand specifications, which in general are trivial on ppc 
(register or immediate). 

Don't start with 32-bit compressed oops mode.  It's anyways 
only for small applications, as it's limited to < 4G heap.  
On the other side, look at 32-bit cOops mode, it's not 1:1 
what we run in SAP JVM, so it might be suboptimal :)

Look at IR dumps of the different phases. Oracle has a graph viewer, 
(we have a different one, though) which dumps every phase.
Here you can see who messed it up.  Was it there before register
allocation? before matching?  This gives hints where to improve.
PrintOptoAssembly and PrintAssembly help too. (do you have the
disassembler library? You can download it here 
http://cr.openjdk.java.net/~simonis/ppc-aix-port/index.html)

Phases currently not supported:
peephole optimization and OptoScheduling.
You can just switch on OptoScheduling, it should work for measurements
to see whether it's a useful target.

This is a lot of stuff, feel free to pick an issue and 
continue on it, but not all at the same time please :)

Best regards,
  Goetz.