RFR: 8255553: [PPC64] Introduce and use setbc and setnbc P10 instructions [v7]
Corey Ashford
github.com+51754783+coreyashford at openjdk.java.net
Mon Nov 16 21:06:09 UTC 2020
On Thu, 5 Nov 2020 13:30:11 GMT, Ziviani <github.com+670087+jrziviani at openjdk.org> wrote:
>> - setbc RT,BI: sets RT to 1 if CR(BI) is 1, otherwise 0.
>> - setnbc RT,BI: sets RT to -1 if CR(BI) is 1, otherwise 0.
>> Ref: PowerISA 3.1, page 129.
>>
>> These instructions are particularly interesting to improve the following
>> pattern `(src1<src2)? -1: ((src1>src2)? 1: 0)`, which can be found in
>> `instruct cmpL3_reg_reg_ExEx()@ppc.ad`, by removing its branches.
>>
>> Long.toString, that generate such pattern in getChars, has showed a
>> good performance gain by using these new instructions.
>>
>> Example:
>> for (int i = 0; i < 200_000; i++)
>> res = Long.toString((long)i);
>>
>> java -Xcomp -XX:CompileThreshold=1 -XX:-TieredCompilation TestToString
>>
>> Without setbc (average): 0.1178 seconds
>> With setbc (average): 0.0396 seconds
>
> Ziviani has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision:
>
> 8255553: [PPC64] Introduce and use setbc and setnbc P10 instructions
>
> - setbc RT,BI: sets RT to 1 if CR(BI) is 1, otherwise 0.
> - setnbc RT,BI: sets RT to -1 if CR(BI) is 1, otherwise 0.
> Ref: PowerISA 3.1, page 129.
>
> These instructions are particularly interesting to improve the following
> pattern `(src1<src2)? -1: ((src1>src2)? 1: 0)`, which can be found in
> `instruct cmpL3_reg_reg_ExEx()@ppc.ad`, by removing its branches.
>
> Long.toString, that generate such pattern in getChars, has showed a
> good performance gain by using these new instructions.
>
> Example:
> for (int i = 0; i < 200_000; i++)
> res = Long.toString((long)i);
>
> java -Xcomp -XX:CompileThreshold=1 -XX:-TieredCompilation TestToString
>
> Without setbc (average): 0.1178 seconds
> With setbc (average): 0.0396 seconds
Looks great overall! The removal of branches is a big win.
src/hotspot/cpu/ppc/assembler_ppc.inline.hpp line 387:
> 385: { emit_int32(SETBC_OPCODE | rt(d) | bi(biint)); }
> 386: inline void Assembler::setbc(Register d, ConditionRegister cr, Condition cc) {
> 387: setbc(d, bi0(cr, cc));
Indentation here should be 2 spaces.
src/hotspot/cpu/ppc/assembler_ppc.inline.hpp line 392:
> 390: { emit_int32(SETNBC_OPCODE | rt(d) | bi(biint)); }
> 391: inline void Assembler::setnbc(Register d, ConditionRegister cr, Condition cc) {
> 392: setnbc(d, bi0(cr, cc));
Indentation here should be 2 spaces.
src/hotspot/cpu/ppc/macroAssembler_ppc.hpp line 164:
> 162: // branch, jump
> 163: //
> 164: // set dst to -1, 0, +1
Comment should be something like:
set dst to -1, 0, +1, as follows: (some description)
src/hotspot/cpu/ppc/macroAssembler_ppc.inline.hpp line 239:
> 237: }
> 238:
> 239: // set dst to -1, 0, +1
Comment should be something like:
set dst to -1, 0, +1, as follows: (some description)
src/hotspot/cpu/ppc/ppc.ad line 11425:
> 11423: match(Set dst (CmpL3 src1 src2));
> 11424: effect(KILL cr0);
> 11425: ins_cost(DEFAULT_COST * 5);
Should this depend on P10 vs. P9 since the instruction cost changes by 1 ?
src/hotspot/cpu/ppc/ppc.ad line 11760:
> 11758: match(Set dst (CmpF3 src1 src2));
> 11759: effect(KILL cr0);
> 11760: ins_cost(DEFAULT_COST * 6);
Should this depend on P10 vs. P9 because of the different number of instructions needed? Maybe an approx. value is enough when other paths can't come close to competing.
src/hotspot/cpu/ppc/ppc.ad line 11844:
> 11842: match(Set dst (CmpD3 src1 src2));
> 11843: effect(KILL cr0);
> 11844: ins_cost(DEFAULT_COST * 6);
Same question here about P10 vs. P9 regarding cost
src/hotspot/cpu/ppc/templateTable_ppc_64.cpp line 1613:
> 1611: // if unordered_result is 1, treat unordered_result like 'greater than'
> 1612: assert(unordered_result == 1 || unordered_result == -1, "only supported");
> 1613: __ set_cmpu3(R17_tos, (unordered_result == 1) ? false : true);
instead of `(unordered_result == 1) ? false : true`
how about `unordered_result != 1`
src/hotspot/cpu/ppc/templateTable_ppc_64.cpp line 1612:
> 1610: __ fcmpu(CCR0, Rfirst, Rsecond); // compare
> 1611: // if unordered_result is 1, treat unordered_result like 'greater than'
> 1612: assert(unordered_result == 1 || unordered_result == -1, "only supported");
The assertion error "only supported" is unclear to me. Is there precedent for this kind of message?
src/hotspot/cpu/ppc/macroAssembler_ppc.inline.hpp line 251:
> 249: srawi(R0, R0, 31);
> 250: }
> 251: orr(dst, dst, R0);
I think this section could use a bit more detail in the comments as to what's going on. I know comments are missing from the original code too, but as it is, it's clever but a bit obtuse.
-------------
Changes requested by CoreyAshford at github.com (no known OpenJDK username).
PR: https://git.openjdk.java.net/jdk/pull/907
More information about the hotspot-dev
mailing list