RFR: 8267968: [PPC64] Use prefixed load and addi instructions for better performance in POWER10 [v2]

Mon Jun 7 23:39:23 UTC 2021

On Sun, 6 Jun 2021 20:28:27 GMT, Kazunori Ogata <ogatak at openjdk.org> wrote:

>> The POWER10 processor supports prefixed load and addi instructions that have larger displacement field of up to 34-bits. We can reduce instruction cycles to load constant from TOC and load an immediate value to a register.
>> 
>> Assembler::{load|add}_const_optimized() and LoadCon[LPFD]Nodes are modified to use prefixed instructions, with fixing other functions that are affected by this change.
>> 
>> I ran jtreg test on both POWER10 and POWER8 machines by using "make test-tier1" and verified no additional fails by this change. I also ran DaCapo, Renaissance, and SPECjbb2015 on both of them and verified they run successfully.
>
> Kazunori Ogata has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Improve comments in macroAssembler_ppc.cpp

I didn't review the details of the commit's functionality, because there are hundreds of details to check there, and to be honest there's a lot I don't understand about working with C2.

Do you have a set of tests that check different sizes of immediate loads to guarantee you hit every case and emit the correct code?

src/hotspot/cpu/ppc/assembler_ppc.cpp line 359:

> 357:         code_section()->scratch_emit()) {
> 358:       // Always emit a nop if the target is a scratch buffer, otherwise fill_buffer() may raise
> 359:       // an assertion failure because the size of actually generated code can be larger than that

size of the* actual* generated code

src/hotspot/cpu/ppc/assembler_ppc.cpp line 360:

> 358:       // Always emit a nop if the target is a scratch buffer, otherwise fill_buffer() may raise
> 359:       // an assertion failure because the size of actually generated code can be larger than that
> 360:       // in scratch_emit phase. A difference of code buffer addresses for the two phases can result

in the* scratch_emit phase.

src/hotspot/cpu/ppc/assembler_ppc.cpp line 362:

> 360:       // in scratch_emit phase. A difference of code buffer addresses for the two phases can result
> 361:       // in different number of nops for alignment. By emitting a nop before every paddi, we avoid
> 362:       // buffer overrun in acrual code generation phase.

a* buffer  overrun in the* acrual->actual*  code generation phase.

src/hotspot/cpu/ppc/assembler_ppc.cpp line 396:

> 394: 
> 395:   // pli can require a nop for alignement depending on the code address, so we don't use pli
> 396:   // when the caller expects the number of generated code is always the same.

the amount* of generated code ...
or
the size* of the* generated code ...

src/hotspot/cpu/ppc/assembler_ppc.cpp line 454:

> 452:           if (xd) { ori( d, d, (unsigned short)xd); }
> 453:         } else {
> 454:           // Exploit instruction level parallelism if we have a tmp register.

instruction-level  (hyphenated)

src/hotspot/cpu/ppc/assembler_ppc.cpp line 600:

> 598:   // Case 3: Can use paddi. (However, paddi can require a nop for alignement depending
> 599:   //                         on the code address, so we don't use paddi when the caller
> 600:   //                         expects the number of generated code is always the same.

same comment as earlier about "number" vs. amount or size

src/hotspot/cpu/ppc/ppc.ad line 6042:

> 6040: // costs do not prevent matching in this case. For that reason the
> 6041: // operand immL_NM with predicate(false) is used.
> 6042: // On Power 10 and up, this instruction is also used for larger offset upto signed 32-bit.

larger offsets*

src/hotspot/cpu/ppc/ppc.ad line 6327:

> 6325: // costs do not prevent matching in this case. For that reason the
> 6326: // operand immP_NM with predicate(false) is used.
> 6327: // On Power 10 and up, this instruction is also used for larger offset upto signed 32-bit.

offsets*

src/hotspot/cpu/ppc/ppc.ad line 6397:

> 6395: // costs do not prevent matching in this case. For that reason the
> 6396: // operand immF_NM with predicate(false) is used.
> 6397: // On Power 10 and up, this instruction is also used for larger offset upto signed 32-bit.

offsets*

src/hotspot/cpu/ppc/ppc.ad line 6472:

> 6470: // costs do not prevent matching in this case. For that reason the
> 6471: // operand immD_NM with predicate(false) is used.
> 6472: // On Power 10 and up, this instruction is also used for larger offset upto signed 32-bit.

offsets*

-------------

Changes requested by cashford (Author).

PR: https://git.openjdk.java.net/jdk/pull/4267