RFR: 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" [v3]

Tue Jun 11 09:59:18 UTC 2024

On Thu, 6 Jun 2024 15:39:22 GMT, Andrew Haley <aph at openjdk.org> wrote:

> On 6/6/24 13:42, Fei Gao wrote:
> 
> > Sorry, did you mean loading from base plus offset, like `ldr x0, [x6, #8]` or `ldr x0, [x6, x7]`, takes one more cycle than loading from base
> > register only, like `ldr x0, [x6]`?  Does the address addition take one
> > cycle?
> 
> We know that, on many Arm cores, Store μOPs are split into address and data μOPs which are executed separately. That doesn't usually cause any additional delay, because cores execute many operations in parallel, so an address generation μOP for base+offset very probably will execute in parallel with some previous instructions, meaning that the target address is ready before it is needed. This split of address generation must happen regardless of whether a store (or a load) is a single instruction
> 
> `str x0, [x1, #80]`
> 
> or a pair of instructions
> 
> `add r8, x1, #80; str x0, [x8]`.
> 
> Of course, a pair of instructions occupies twice as much icache space, and you can run out of instruction decode bandwidth. However, in the case of Unsafe operations, I don't believe that an occasional unnecessary two-instruction operation will result in a performance regression.

Thanks for your kind explanation @theRealAph . That quite makes sense to me. I'll continue processing this pull request to implement it.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/16991#issuecomment-2160321273