RFR: 8340241: RISC-V: Returns mispredicted [v15]
Fei Yang
fyang at openjdk.org
Wed Oct 16 11:45:18 UTC 2024
On Wed, 16 Oct 2024 09:18:53 GMT, Robbin Ehn <rehn at openjdk.org> wrote:
>> Hi, please consider.
>>
>> RISC-V don't have dedicated call/ret instructions.
>> Instead the registers used in the jal/jalr instructions determine if this is a JUMP or CALL/RET.
>> The cpu have a return-address stack where it stores return addresses for prediction.
>> There are two possible calling conventions: x1 and x5 (or using both for co-routines).
>> This stack is updated according this table (from unpriv manual, 2.5.1. Unconditional Jumps) for JALR:
>>
>> | rd is x1/x5 | rs1 is x1/x5 | rd=rs1 | RAS action
>> | ------------- | ------------- | ------------- |------------- |
>> |No | No | — | None|
>> |No | Yes | — | Pop|
>> |Yes | No | — | Push|
>> |Yes | Yes | No | Pop, then push|
>> |Yes | Yes | Yes | Push|
>>
>> And additionally:
>> "A JAL instruction should push the return address onto a return-address stack (RAS) only when rd is 'x1' or x5."
>>
>> As the JDK is using x5/(t0) as main scratch all plains jumps are actually calls and calls are co-routine calls (push and pop).
>> This causes performance issues as the predictions is often wrong.
>>
>> Average time for 10 best iterations (VF2):
>> | Benchmark | Baseline (ms) | RAS fixed (ms) | Diff |
>> |-- | -- | -- | -- |
>> |future-genetic | 22126.6 | 20461.8 | -7.52%|
>> |akka-uct | 97119.6 | 97498 | 0.39%|
>> |movie-lens | 82359.3 | 81009.2 | -1.64%|
>> |scala-doku | 29246.1 | 24518.6 | -16.16%|
>> |chi-square | 10207.3 | 10624.9 | 4.09%|
>> |fj-kmeans | 55127.9 | 56169.1 | 1.89%|
>> |finagle-http | 24845 | 24891.9 | 0.19%|
>> |reactors | 97473.9 | 96655.5 | -0.84%|
>> |dec-tree | 8322.99 | 8243.11 | -0.96%|
>> |naive-bayes | 79249.1 | 76851.9 | -3.02%|
>> |als | 52678 | 51245.9 | -2.72%|
>> |par-mnemonics | 52237.4 | 53149.8 | 1.75%|
>> |scala-kmeans | 2990.88 | 2992.14 | 0.04%|
>> |philosophers | 9156.9 | 7754.5 | -15.32%|
>> |log-regression | 7621.65 | 7540.85 | -1.06%|
>> |gauss-mix | 9835.7 | 9396.25 | -4.47%|
>> |mnemonics | 73087.3 | 69426.6 | -5.01%|
>> |dotty | 10970.9 | 10719.1 | -2.30%|
>> |finagle-chirper | 23386.1 | 23630.3 | 1.04%|
>> |recursive fibonacci | 7338.56 | 5369.83 | **-26.83%**|
>>
>> For some of workloads, e.g. call to small function in a loop, it really matters.
>>
>> This patch blacklist x5(/t0) for JAL/JALR as we only use x1 calling convention.
>> And changes all jumps to use x6(/t1) instead of x5(/t0).
>> This patch was incrementally done, i.e. the first change removed the default t0.
>> I visited all places makings jumps, to make sure t1 was available.
>> Then changed to default t1 and removed argument in many...
>
> Robbin Ehn has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 18 additional commits since the last revision:
>
> - Merge branch 'master' into remove_t0
> - Allow x1 as Rs
> - Comments fixes
> - Merge branch 'master' into remove_t0
> - ICData move to t1, removed dead code
> - Revert clinit_barrier
> - Fixed no explicit use of default t1
> - Revert clinit_barrier t1
> - Upstream comment
> - Fixed no explicit use of default t1
> - ... and 8 more: https://git.openjdk.org/jdk/compare/6268c04f...52b02cb8
Thanks for the update. Seems that we missed the jump in `MacroAssembler::trampoline_call` [1]? I also witnessed another place where we missed killing the rflags after this change. See comment for details.
[1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp#L4272
src/hotspot/cpu/riscv/macroAssembler_riscv.cpp line 4082:
> 4080: Register tmp2,
> 4081: Register tmp3) {
> 4082: assert_different_registers(r_sub_klass, r_super_klass, tmp1, tmp2, tmp3, result, t0, t1);
This `MacroAssembler::verify_secondary_supers_table` is used by `MacroAssembler::lookup_secondary_supers_table` which is called by `instruct partialSubtypeCheckConstSuper` in riscv.ad. We should list the rflags as being killed for this instruct as well. As this is not obvious, we need to double check if there are other similar occurrences like this.
-------------
PR Review: https://git.openjdk.org/jdk/pull/21406#pullrequestreview-2372155596
PR Review Comment: https://git.openjdk.org/jdk/pull/21406#discussion_r1802929358
More information about the hotspot-dev
mailing list