RFR: 8332689: RISC-V: Use load instead of trampolines [v4]
Fei Yang
fyang at openjdk.org
Wed Jun 5 07:06:57 UTC 2024
On Tue, 4 Jun 2024 16:34:34 GMT, Robbin Ehn <rehn at openjdk.org> wrote:
>> Hi all, please consider!
>>
>> Today we do JAL to **dest** if **dest** is in reach (+/- 1 MB).
>> Using a very small application or running very short time we have fast patchable calls.
>> But any normal application running longer will increase the code size and code chrun/fragmentation.
>> So whatever or not you get hot fast calls rely on luck.
>>
>> To be patchable and get code cache reach we also emit a stub trampoline which we can point the JAL to.
>> This would be the common case for a patchable call.
>>
>> Code stream:
>> JAL <trampo>
>> Stubs:
>> AUIPC
>> LD
>> JALR
>> <DEST>
>>
>>
>> On some CPUs L1D and L1I can't contain the same cache line, which means the tramopline stub can bounce from L1I->L1D->L1I, which is expensive.
>> Even if you don't have that problem having a call to a jump is not the fastest way.
>> Loading the address avoids the pitsfalls of cmodx.
>>
>> This patch suggest to solve the problems with trampolines, we take small penalty in the naive case of JAL to **dest**,
>> and instead do by default:
>>
>> Code stream:
>> AUIPC
>> LD
>> JALR
>> Stubs:
>> <DEST>
>>
>> An experimental option for turning trampolines back on exists.
>>
>> It should be possible to enhanced this with the WIP [Zjid](https://github.com/riscv/riscv-j-extension) by changing the JALR to JAL and nop out the auipc+ld (as the current proposal of Zjid forces the I-fetcher to fetch instruction in order (meaning we will avoid a lot issues which arm has)) when in reach and vice-versa.
>>
>> Numbers from VF2 (I have done them a few times, they are always overall in favor of this patch):
>>
>> fop (msec) 2239 | 2128 = 0.950424
>> h2 (msec) 18660 | 16594 = 0.889282
>> jython (msec) 22022 | 21925 = 0.995595
>> luindex (msec) 2866 | 2842 = 0.991626
>> lusearch (msec) 4108 | 4311 = 1.04942
>> lusearch-fix (msec) 4406 | 4116 = 0.934181
>> pmd (msec) 5976 | 5897 = 0.98678
>> jython (msec) 22022 | 21925 = 0.995595
>> Avg: 0.974112
>> fop(xcomp) (msec) 2721 | 2714 = 0.997427
>> h2(xcomp) ...
>
> Robbin Ehn has updated the pull request incrementally with one additional commit since the last revision:
>
> Move shart/far code to cpp
Hi, I only have several minor comments for now. Still looking at the code.
src/hotspot/cpu/riscv/nativeInst_riscv.cpp line 330:
> 328:
> 329: enum RISCV_specific_constants {
> 330: return_address_offset = 3 * NativeInstruction::instruction_size, // ld auipc jalr
Suggestion about the code comment: `// auipc + ld + jalr`
src/hotspot/cpu/riscv/nativeInst_riscv.cpp line 636:
> 634: call = (NativeCall*)(return_address - NativeFarCall::return_address_offset);
> 635: } else {
> 636: call = (NativeCall*)(return_address - NativeShortCall::instruction_size);
Maybe it's better to have a `return_address_offset` (which equals `NativeShortCall::instruction_size`) const for `NativeShortCall` too?
src/hotspot/cpu/riscv/nativeInst_riscv.hpp line 118:
> 116: public:
> 117: enum {
> 118: instruction_size = 3 * Assembler::instruction_size,
It looks odd for this `instruction_size` `NativeCall` to have a size of 12 here. It should depends on whether it is a `NativeShortCall` for `NativeFarCall` at the bottom, right?
-------------
PR Review: https://git.openjdk.org/jdk/pull/19453#pullrequestreview-2098062531
PR Review Comment: https://git.openjdk.org/jdk/pull/19453#discussion_r1626976092
PR Review Comment: https://git.openjdk.org/jdk/pull/19453#discussion_r1627098898
PR Review Comment: https://git.openjdk.org/jdk/pull/19453#discussion_r1627095666
More information about the hotspot-dev
mailing list