RFR: 8332689: RISC-V: Use load instead of trampolines [v7]

Fri Jun 7 07:20:14 UTC 2024

On Thu, 6 Jun 2024 20:31:26 GMT, Hamlin Li <mli at openjdk.org> wrote:

>> Robbin Ehn has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Remove tmp file
>
> src/hotspot/cpu/riscv/macroAssembler_riscv.cpp line 987:
> 
>> 985:   int64_t distance = source - pc();
>> 986:   assert(is_simm32(distance), "Must be");
>> 987:   Assembler::auipc(temp, (int32_t)distance + 0x800);
> 
> Is it possible to use `jal` instead of the instruction sequence when is_simm21 == true as in jump_link?

Long story, sorry.

As this is patchable callsite meaning we need to have full reach for later addresses, this site must be bable to load 'n jump also, hence we need to cmodx.

Todo we do cmodx in sequential consistency maner.
This is done by emitting an IPI shoot down after every store to a published instruction stream.
The cost of an IPI is significant, as all CPUs need to flush everything and start over.
As we have tiny CPUs with few cores and little states, we don't really care much right now.
I have measured this overhead on VF2 to around 0.5% on some work-loads.
But it will scale much worse than linear as core count and complexity goes up.

Using this technique it would be possible.

As we need to change this for the biggers cores comming, and zjid is delayed,
we are getting some kernel features like setting up fenec.i on context switches.
Which means we can use fence.i in userspace and trust kernel will emit fence.i if cpu is changed after we emitted it.
This allows writer to skip IPI, at least in many cases.

When changing a series of instruction we need to know if the instruction fetching happens in-order.
Otherwise:

<nop> + <nop> + <jal immX>
<auipc immA> + <ld immB> + <jal immX>

Now we flip the jal:
`<auipc immA> + <ld immB> + <jalr>`
But if these are not read in-order the I-fetcher might see:
`<auipc immA> + <nop>  + <jalr>`

If we do this with IPI, but then we are more locked into IPI.
So before we have made an overhaul of cmodx (we may need 3-4 approached depending on CPU, if we want the best performance) I prefer to not add code which is dependant on a certain cmodx approach (when it's slow).

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/19453#discussion_r1630756423