RFR: 8340241: RISC-V: Returns mispredicted
Robbin Ehn
rehn at openjdk.org
Wed Oct 9 08:10:24 UTC 2024
Hi, please consider.
RISC-V don't have dedicated call/ret instructions.
Instead the registers used in the jal/jalr instructions determine if this is a JUMP or CALL/RET.
The cpu have a return-address stack where it stores return addresses for prediction.
There are two possible calling conventions: x1 and x5 (or using both for co-routines).
This stack is updated according this table (from unpriv manual, 2.5.1. Unconditional Jumps) for JALR:
| rd is x1/x5 | rs1 is x1/x5 | rd=rs1 | RAS action
| ------------- | ------------- | ------------- |------------- |
|No | No | — | None|
|No | Yes | — | Pop|
|Yes | No | — | Push|
|Yes | Yes | No | Pop, then push|
|Yes | Yes | Yes | Push|
And additionally:
"A JAL instruction should push the return address onto a return-address stack (RAS) only when rd is 'x1' or x5."
As the JDK is using x5/(t0) as main scratch all plains jumps are actually calls and calls are co-routine calls (push and pop).
This causes performance issues as the predictions is often wrong.
Average time for 10 best iterations:
| Benchmark | Baseline (ms) | RAS fixed (ms) | Diff |
|-- | -- | -- | -- |
|future-genetic | 22126.6 | 20461.8 | -7.52%|
|akka-uct | 97119.6 | 97498 | 0.39%|
|movie-lens | 82359.3 | 81009.2 | -1.64%|
|scala-doku | 29246.1 | 24518.6 | -16.16%|
|chi-square | 10207.3 | 10624.9 | 4.09%|
|fj-kmeans | 55127.9 | 56169.1 | 1.89%|
|finagle-http | 24845 | 24891.9 | 0.19%|
|reactors | 97473.9 | 96655.5 | -0.84%|
|dec-tree | 8322.99 | 8243.11 | -0.96%|
|naive-bayes | 79249.1 | 76851.9 | -3.02%|
|als | 52678 | 51245.9 | -2.72%|
|par-mnemonics | 52237.4 | 53149.8 | 1.75%|
|scala-kmeans | 2990.88 | 2992.14 | 0.04%|
|philosophers | 9156.9 | 7754.5 | -15.32%|
|log-regression | 7621.65 | 7540.85 | -1.06%|
|gauss-mix | 9835.7 | 9396.25 | -4.47%|
|mnemonics | 73087.3 | 69426.6 | -5.01%|
|dotty | 10970.9 | 10719.1 | -2.30%|
|finagle-chirper | 23386.1 | 23630.3 | 1.04%|
|recursive fibonacci | 7338.56 | 5369.83 | **-26.83%**|
For some of workloads, e.g. call to small function in a loop, it really matters.
This patch blacklist x5(/t0) for JAL/JALR as we only use x1 calling convention.
And changes all jumps to use x6(/t1) instead of x5(/t0).
This patch was incrementally done, i.e. the first change removed the default t0.
I visited all places makings jumps, to make sure t1 was available.
Then changed to default t1 and removed argument in many cases.
Other approaches was tested, e.g. completely switch t0 <-> t1.
This was much harder and more intrusive as you need to do the switch completely in one go.
The use of x6(/t1) as flag register in C2 was luckily not an issue as RFLAGS is always killed when making a jump.
But please inspect this.
**Note** jump label was a bit more tricky. To solve that this patch defaults to only use JAL when no register is supplied, now default. We never jump to a label so far away that we need a longer range.
But please consider this carefully.
Secondly **note** CompiledICData was moved to x5(/t0), as x1+x6 (ra/t1) is used for the call.
Please inspect this also. (as this can go silently unnotice but causing UV to go into runtime for IC miss)
Arguably this is a performance bug, not an enhancement.
No issues found running t1->t3 fastdebug, re-testing more to make sure.
-------------
Commit messages:
- Draft
Changes: https://git.openjdk.org/jdk/pull/21406/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=21406&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8340241
Stats: 148 lines in 19 files changed: 18 ins; 3 del; 127 mod
Patch: https://git.openjdk.org/jdk/pull/21406.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/21406/head:pull/21406
PR: https://git.openjdk.org/jdk/pull/21406
More information about the hotspot-dev
mailing list