RFR: 8340241: RISC-V: Returns mispredicted

Wed Oct 9 13:25:35 UTC 2024

On Tue, 8 Oct 2024 12:54:35 GMT, Robbin Ehn <rehn at openjdk.org> wrote:

> Hi, please consider.
> 
> RISC-V don't have dedicated call/ret instructions.
> Instead the registers used in the jal/jalr instructions determine if this is a JUMP or CALL/RET.
> The cpu have a return-address stack where it stores return addresses for prediction.
> There are two possible calling conventions: x1 and x5 (or using both for co-routines).
> This stack is updated according this table (from unpriv manual, 2.5.1. Unconditional Jumps) for JALR:
> 
> | rd is x1/x5 | rs1 is x1/x5 | rd=rs1 | RAS action
> | ------------- | ------------- | ------------- |------------- |
> |No | No | — | None|
> |No | Yes | — | Pop|
> |Yes | No | — | Push|
> |Yes | Yes | No | Pop, then push|
> |Yes | Yes | Yes | Push|
> 
> And additionally:
> "A JAL instruction should push the return address onto a return-address stack (RAS) only when rd is 'x1' or x5."
> 
> As the JDK is using x5/(t0) as main scratch all plains jumps are actually calls and calls are co-routine calls (push and pop).
> This causes performance issues as the predictions is often wrong.
> 
> Average time for 10 best iterations (VF2):
> | Benchmark | Baseline (ms) | RAS fixed (ms) | Diff |
> |-- | -- | -- | -- |
> |future-genetic | 22126.6 | 20461.8 | -7.52%|
> |akka-uct | 97119.6 | 97498 | 0.39%|
> |movie-lens | 82359.3 | 81009.2 | -1.64%|
> |scala-doku | 29246.1 | 24518.6 | -16.16%|
> |chi-square | 10207.3 | 10624.9 | 4.09%|
> |fj-kmeans | 55127.9 | 56169.1 | 1.89%|
> |finagle-http | 24845 | 24891.9 | 0.19%|
> |reactors | 97473.9 | 96655.5 | -0.84%|
> |dec-tree | 8322.99 | 8243.11 | -0.96%|
> |naive-bayes | 79249.1 | 76851.9 | -3.02%|
> |als | 52678 | 51245.9 | -2.72%|
> |par-mnemonics | 52237.4 | 53149.8 | 1.75%|
> |scala-kmeans | 2990.88 | 2992.14 | 0.04%|
> |philosophers | 9156.9 | 7754.5 | -15.32%|
> |log-regression | 7621.65 | 7540.85 | -1.06%|
> |gauss-mix | 9835.7 | 9396.25 | -4.47%|
> |mnemonics | 73087.3 | 69426.6 | -5.01%|
> |dotty | 10970.9 | 10719.1 | -2.30%|
> |finagle-chirper | 23386.1 | 23630.3 | 1.04%|
> |recursive fibonacci | 7338.56 | 5369.83 | **-26.83%**|
> 
> For some of workloads, e.g. call to small function in a loop, it really matters.
> 
> This patch blacklist x5(/t0) for JAL/JALR as we only use x1 calling convention.
> And changes all jumps to use x6(/t1) instead of x5(/t0).
> This patch was incrementally done, i.e. the first change removed the default t0.
> I visited all places makings jumps, to make sure t1 was available.
> Then changed to default t1 and removed argument in many cases.
> 
> Other approaches was tested, e.g. completely switch t0 <-> t1.
> This was much harder and more intr...

Great finding. Apparently, we didn't realize such an impact of this prediction hints before. Let me try this on hardwares from other vendors to see.
BTW: Does this have anything to do with the use of x6(/t1) as flag register in C2? Seems to me it's all about encoding of x1 and x5 in jalr.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21406#issuecomment-2402314672