ppc C2 optimization possibilities
Lindenmaier, Goetz
goetz.lindenmaier at sap.com
Mon Oct 10 21:09:11 UTC 2016
Hi Gustavo,
I post this to hotspot-compiler-dev, I guess that's where it belongs :)
find my comment inline:
> 1. Use indexed loads efficiently (no addi before a ld, for instance);
Indexed loads change two registers, index and loaded value, right?
This is almost impossible to express in adl, and the matcher can not handle
it, as it only matches DAGs. There is a special case for DivMod in x86.
Load might be more complicated because it has memory edges.
> 2. Use of dot instructions to avoid an additional compare instruction;
I'm not familiar with dot instructions, I assume it's an arithmetic instruction
that sets a condition code?
Similarly as 1.) the dot instruction sets two (condition) registers. It would be possible to match
branch, compare and add into one MachNode. But you need all combinations ...
An alternative would be to add peephole rules for this, see ad file. We did not
explore them at all. They are run as a very last step, and you can rewrite
assembler sequences.
> 3. Improve Hoisting and CSE (Common Sub-expression Elimination) in some
> particular cases;
Yep, there is room for improvement. Probably a shared change.
> 4. Avoid unnecessary reload of constants;
The reload is probably the result of rematerialization during register
allocation. You might see that with PrintOptoAssembly. This is hard to
fix in the register allocation, might be possible in a pass thereafter.
Other of the cases you mention below are 'different' constants, i.e.,
different in type. Could be only removed by peephole optimization.
> 5. Eliminate some redundant register moves;
They might be due to different types, see below.
> 6. Eliminate some redundant compare sequences.
>
> I've just started to have a deeper look at it. A partner (Eldorado) is also
> helping to address these issues.
>
> I'm trying to get more familiar with the AD language since I believe most
> changes are related to the match rule in the IR and the instruction
> emission determined in the ppc.ad file. But I could not find a much
> comprehensive documentation on it, even on the Wiki (maybe on some
> book?).
>
> If you have any advice or any comment on were/what I should look firstly in
> the code (which phase on the compilation chain) I would very glad.
>
> Thank you!
>
>
> Kind regards,
> Gustavo
>
>
> [From MunninPageCursor]
> === For addressing, omit addis with 0, use the existing base register (r22
> here).
> 2244306 0.3227 : 3fff60240acc: addis r17,r22,0
> 361236 0.0519 : 3fff60240ad0: ld r17,0(r17)
You need to look at -XX:+PrintOptoAssembly.
I would assume that you are running in 32-bit compressed oop mode
(check -XX:+PrintCompressedOopsMode) and
r22 is a compressed oop. Then DecodeN is matched into AddP,
and r17 holds an oop. The IR doesn't grok that it's the same value
if it has different types. This should mostly be an issue in 32-bit oop mode,
with works only up to 4G heap. Run with -XX:HeapBaseMinAddress = 4G / 32G / 33G
to get the different compressed oop modes, which should be better with this.
The copies will then be replaced by the decoding.
It might also be affected by settings/transformations for NullCheckElimination.
We tried to implement an optimization for 32-bit compressed oop mode,
part of which are in openJdk. I never contributed the shared changes because
it showed no effect at all. It removed only 30% of the useless copies, and introduced
additional MachSpillCopies.
One should not issue DecodeN / EncodeP in the frontend at all, and add new
Ideal nodes to the frontend that do LoadN2P and StoreP2N. Then all
oop->cOop and cOop->oop copies would be gone. This is quite a shared
change, though. jdk10?
> === Use dot for extend sign operation instead of following with compare
> against 0.
> 390635 0.0562 : 3fff60240af0: extsb r17,r18
> 923383 0.1328 : 3fff60240af4: cmpwi cr6,r17,0
Must see the MachNodes for this. Might be possible to
match.
> === Use cmpdi instead of li/cmpd (r14 killed after this).
> 240944 0.0346 : 3fff60240bec: li r14,0
> 6943 1.0e-03 : 3fff60240bf0: cmpld cr6,r3,r14
Also must see PrintOptoAssembly. This should not happen. There are rules
for this, for example cmpI_reg_imm16
which matches a register (iRegIsrc src1) and a 16-bit immediate (immI16 src2), which itself
matches a ConI node (see decl of operand immI16) as inputs. The node matched is CmpI. It sets
a condition register (flagsReg crx). Look for CmpL, CmpP, CmpN etc in match() statement.
Those match ideal "class CmpLNode" etc.
The resulting node is "class cmpI_reg_imm16Node" in ad_ppc64.hpp. The match rule is
hard to read/find in the generated code.
> === Omit sign extension when comparing to 0, compare with value from zero
> extended load.
> 179491 0.0258 : 3fff60240aec: lbz r18,0(r15)
> 390635 0.0562 : 3fff60240af0: extsb r17,r18
> 923383 0.1328 : 3fff60240af4: cmpwi cr6,r17,0
> === Hoist CSE of constant load.
> (Original Sequence)
> 852288 0.1225 : 3fff60240c20: cmpwi cr6,r15,0
> 140 2.0e-05 : 3fff60240c24: beq cr6,3fff60240c30
> 1 1.4e-07 : 3fff60240c28: li r15,0
> : 3fff60240c2c: stb r15,0(r14)
> 1374509 0.1976 : 3fff60240c30: li r15,0
> 4224 6.1e-04 : 3fff60240c34: stw r15,60(r23)
>
> (Optimized Sequence):
> 852288 0.1225 : 3fff60240c20: cmpwi cr6,r15,0
> : 3fff60240c24: li r15,0 <if register selection can
> avoid collision with cmpwi's registers, other placement possible>
> : 3fff60240c28: beq cr6,3fff60240c34
> : 3fff60240c30: stb r15,0(r14)
> : 3fff60240c34: stw r15,60(r23)
The problem is that the first is Byte, the second is Int. Two different
constants for C2. You would have to transform
ConI(0) ConB(0)
| |
| |
useI useB
to
ConB(0)
/ \
/ \
ConvB2I useB
|
|
useI
But this will hinder other optimizations as matching ConI as immediate operand.
Basically, we implemented a scheduler on ia64. It did a lot, but the hoisting payed
off the most. But it required a lot of shared changes. So I think this would
have an effect on ppc, too. You can not express hoisting as match rule, though.
We also have a cse phase during gcm, that's pretty stand-alone, but
I can't tell whether it pays off.
Basically, the placement is done during gcm, and then register allocation
runs. After gcm, mostly only rematerialization moves code.
Final code 'movement' can be done by a intra-block scheduler (-XX:+OptoScheduling)
which is off as I encountered a wrong schedule, see c2_globals_ppc.hpp.
Also, for this the pipeline specifications in the ad file must be improved.
OptoScheduling and Peephole optimizations are exclusive.
> [From StoreNodeRelationshipCursor;nextChainStart]
> === Avoid reload of constant (no branches into the following code)
> 728853 0.1048 : 3fff60260f5c: li r15,0 ### Load constant
> into r17 instead of r15, since r15 is clobbered prior to last stw (...0f84)
> 6864 9.9e-04 : 3fff60260f60: stb r15,64(r17)
> 1 1.4e-07 : 3fff60260f64: li r14,0 ### Delete
> 539571 0.0776 : 3fff60260f68: std r14,72(r17) ### (Use r17)
> 4224 6.1e-04 : 3fff60260f6c: lis r14,0 ### Delete
> 262344 0.0377 : 3fff60260f70: std r14,48(r17) ### (use r17)
> 665350 0.0957 : 3fff60260f74: ld r14,40(r1)
> 4792 6.9e-04 : 3fff60260f78: lbz r15,93(r14)
> 199625 0.0287 : 3fff60260f7c: std r22,112(r1)
> 852368 0.1225 : 3fff60260f80: li r17,0 ### Delete
> 9512 0.0014 : 3fff60260f84: stw r17,88(r14)
> 232750 0.0335 : 3fff60260f88: ld r17,112(r1)
>
>
> [From RelationshipGroupStore;readRecord]
> === Use indexed load in place of add followed by load with 0 offset.
> 131733 0.0189 : 3fff6025b684: add r14,r20,r14 ### delete
> 359529 0.0517 : 3fff6025b688: lbz r14,0(r14) ### do lbz
> r14,r20,r14
>
>
> === Eliminate redundant move.
> 111964 0.0161 : 3fff6025b63c: ld r14,48(r22) ### load into r20 in
> the first place
> 266837 0.0384 : 3fff6025b640: extsw r18,r31
> 3965285 0.5701 : 3fff6025b644: mr r20,r14 ### delete
> 1104317 0.1588 : 3fff6025b648: add r14,r20,r18 ### (r14 killed here)
>
>
> [From StorePropertyCursor;next]
> === Eliminate redundant compare (note cmpwi below for values 7, 3, 1, with
> a repetition each)
> 18451 0.0027 : 3fff602f1b70: cmpwi cr5,r15,7
> : 3fff602f1b74: beq cr5,3fff602f1cbc
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0x1bc>
> 70972 0.0102 : 3fff602f1b78: cmpwi cr6,r15,7
> 14754 0.0021 : 3fff602f1b7c: bgt cr6,3fff602f1c30
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0x130>
> 97107 0.0140 : 3fff602f1b80: cmpwi cr5,r15,3
> 33755 0.0049 : 3fff602f1b84: beq cr5,3fff602f1c1c
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0x11c>
> 51 7.3e-06 : 3fff602f1b88: cmpwi cr6,r15,3
> 68819 0.0099 : 3fff602f1b8c: bgt cr6,3fff602f1bd0
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0xd0>
> : 3fff602f1b90: cmpwi cr5,r15,1
> : 3fff602f1b94: beq cr5,3fff602f1bbc
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0xbc>
> : 3fff602f1b98: cmpwi cr6,r15,1
> : 3fff602f1b9c: ble cr6,3fff602f1bb4
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0xb4>
>
> === Two things here:
> === 1) Redundant move "mr r3,r14" can be eliminated if load goes into r3 to
> begin with.
> === 2) the "ld r14,56(r1) repeated below can be treated as a CSE and done
> prior to the first compare (r14 is dead @ ...f1cc8)
> (The case at 3fff602f1bb4 doesn't require it, but since it is a load from
> the stack-frame its ok to perform)
> 18451 0.0027 : 3fff602f1b70: cmpwi cr5,r15,7
> : 3fff602f1b74: beq cr5,3fff602f1cbc
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0x1bc>
> 70972 0.0102 : 3fff602f1b78: cmpwi cr6,r15,7
> 14754 0.0021 : 3fff602f1b7c: bgt cr6,3fff602f1c30
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0x130>
> 97107 0.0140 : 3fff602f1b80: cmpwi cr5,r15,3
> 33755 0.0049 : 3fff602f1b84: beq cr5,3fff602f1c1c
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0x11c>
> 51 7.3e-06 : 3fff602f1b88: cmpwi cr6,r15,3
> 68819 0.0099 : 3fff602f1b8c: bgt cr6,3fff602f1bd0
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0xd0>
> : 3fff602f1b90: cmpwi cr5,r15,1
> : 3fff602f1b94: beq cr5,3fff602f1bbc
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0xbc>
> : 3fff602f1b98: cmpwi cr6,r15,1
> : 3fff602f1b9c: ble cr6,3fff602f1bb4
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0xb4>
> : 3fff602f1ba0: ld r14,56(r1) ###(2) CSE
> : 3fff602f1ba4: addis r14,r14,0 ### Delete
> : 3fff602f1ba8: ld r14,0(r14)
> : 3fff602f1bac: mr r3,r14 ### (1) Redundant
> move; load into r3 instead of r14
> : 3fff602f1bb0: b 3fff602f1cc8
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0x1c8>
> : 3fff602f1bb4: li r3,0
> : 3fff602f1bb8: b 3fff602f1cc8
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0x1c8>
> : 3fff602f1bbc: ld r14,56(r1) ###(2) CSE
> : 3fff602f1bc0: addis r14,r14,0 ### Delete
> : 3fff602f1bc4: ld r14,8(r14)
> : 3fff602f1bc8: mr r3,r14 ### (1) Redundant
> move; load into r3 instead of r14
> : 3fff602f1bcc: b 3fff602f1cc8
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0x1c8>
> 21303 0.0031 : 3fff602f1bd0: cmpwi cr5,r15,5
> 66514 0.0096 : 3fff602f1bd4: beq cr5,3fff602f1c08
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0x108>
> 67 9.6e-06 : 3fff602f1bd8: cmpwi cr6,r15,5
> 183659 0.0264 : 3fff602f1bdc: ble cr6,3fff602f1bf4
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0xf4>
> 248 3.6e-05 : 3fff602f1be0: ld r14,56(r1) ###(2) CSE
> : 3fff602f1be4: addis r14,r14,0 ### Delete
> 22 3.2e-06 : 3fff602f1be8: ld r14,16(r14)
> 38188 0.0055 : 3fff602f1bec: mr r3,r14 ### (1)
> Redundant move; load into r3 instead of r14
> 6576 9.5e-04 : 3fff602f1bf0: b 3fff602f1cc8
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0x1c8>
> : 3fff602f1bf4: ld r14,56(r1) ###(2) CSE
> : 3fff602f1bf8: addis r14,r14,0 ### Delete
> : 3fff602f1bfc: ld r14,24(r14)
> : 3fff602f1c00: mr r3,r14 ### (1) Redundant
> move; load into r3 instead of r14
> : 3fff602f1c04: b 3fff602f1cc8
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0x1c8>
> 4754 6.8e-04 : 3fff602f1c08: ld r14,56(r1) ###(2) CSE
> : 3fff602f1c0c: addis r14,r14,0 ### Delete
> 6 8.6e-07 : 3fff602f1c10: ld r14,32(r14)
> 1011 1.5e-04 : 3fff602f1c14: mr r3,r14 ### (1) Redundant
> move; load into r3 instead of r14
> 6 8.6e-07 : 3fff602f1c18: b 3fff602f1cc8
> <Lorg/neo4j/kernel/impl/api/store/StorePropertyCursor;next()Z+0x1c8>
> : 3fff602f1c1c: ld r14,56(r1) ###(2) CSE
> : 3fff602f1c20: addis r14,r14,0 ### Delete
> : 3fff602f1c24: ld r14,40(r14)
> : 3fff602f1c28: mr r3,r14 ### (1) Redundant
> move; load into r3 instead of r14
> : 3fff602f1c2c: b 3fff602f1cc8
I don't really remember the issue with compares ... one thing is that
the register allocation can not spill the condition code. Therefore it must be able to
rematerialize the compare. Maybe I will remember ... I saw it
happen that Cmp results were reused somewhere.
> [RelationshipStore;readRecord]
> === Use andis. in place of following sequence.
> 845 1.2e-04 : 3fff6027e134: lis r18,7
> 36054 0.0052 : 3fff6027e138: and r18,r17,r18
Should be matched by andI_reg_uimm16 or similar.
> === Another redundant move case, slightly more spread.
> 21 3.0e-06 : 3fff6027de44: ld r14,48(r11) ### This load should
> target r28
> 201938 0.0290 : 3fff6027de48: addis r15,r24,0
> 56239 0.0081 : 3fff6027de4c: ld r15,16(r15)
> 698530 0.1004 : 3fff6027de50: lwa r15,32(r15)
> 51 7.3e-06 : 3fff6027de54: mr r28,r14 ### Delete
> 12361531 1.7772 : 3fff6027de58: extsw r14,r25 ### (Kills r14)
Yes, bad register allocation.
> === Use dot variant and eliminate compare with 0, must be ok with using cr7
> 301138 0.0433 : 3fff6027e06c: rldic r15,r15,10,22 ### Use dot
> variant into cr7
> 17978 0.0026 : 3fff6027e070: cmpdi cr5,r15,0 ### Delete - branch
> must use cr7 though
> === lis of 0 is same as li 0, eliminate redundancy.
> 40511 0.0058 : 3fff60285d9c: lis r15,0 ### Move "li r5,0"
> here (replacing lis)
> 123000 0.0177 : 3fff60285da0: std r15,48(r14) ### Change to use r5
> 42789 0.0062 : 3fff60285da4: ld r3,72(r1)
> 16111 0.0023 : 3fff60285da8: li r5,0 ### r5 must be set
> for call, so set it above and eliminate lis.
> 36 5.2e-06 : 3fff60285dac: bl 3fff60240a90
Again, different types of constants I assume. Candidate for Peephole.
General advice:
To understand adl: Look at the 'instruct' declarations.
Their names are pretty standardized, trying to say what they match.
The 'match' statement is the match tree, other parts are in
the operand specifications, which in general are trivial on ppc
(register or immediate).
Don't start with 32-bit compressed oops mode. It's anyways
only for small applications, as it's limited to < 4G heap.
On the other side, look at 32-bit cOops mode, it's not 1:1
what we run in SAP JVM, so it might be suboptimal :)
Look at IR dumps of the different phases. Oracle has a graph viewer,
(we have a different one, though) which dumps every phase.
Here you can see who messed it up. Was it there before register
allocation? before matching? This gives hints where to improve.
PrintOptoAssembly and PrintAssembly help too. (do you have the
disassembler library? You can download it here
http://cr.openjdk.java.net/~simonis/ppc-aix-port/index.html)
Phases currently not supported:
peephole optimization and OptoScheduling.
You can just switch on OptoScheduling, it should work for measurements
to see whether it's a useful target.
This is a lot of stuff, feel free to pick an issue and
continue on it, but not all at the same time please :)
Best regards,
Goetz.
More information about the hotspot-compiler-dev
mailing list