RFR: 8332689: RISC-V: Use load instead of trampolines

Wed May 29 14:28:07 UTC 2024

Hi all, please consider!

Today we do JAL to **dest** if **dest** is in reach (+/- 1 MB).
Using a very small application or running very short time we have fast patchable calls.
But any normal application running longer will increase the code size and code chrun/fragmentation.
So whatever or not you get hot fast calls rely on luck.

To be patchable and get code cache reach we also emit a stub trampoline which we can point the JAL to.
This would be the common case for a patchable call.

Code stream:
JAL <trampo>
Stubs:
AUIPC
LD
JALR
<DEST>

On some CPUs L1D and L1I can't contain the same cache line, which means the tramopline stub can bounce from L1I->L1D->L1I, which is expensive.
Even if you don't have that problem having a call to a jump is not the fastest way.
Loading the address avoids the pitsfalls of cmodx.

This patch suggest to solve the problems with trampolines, we take small penalty in the naive case of JAL to **dest**,
and instead do by default:

Code stream:
AUIPC
LD
JALR
Stubs:
<DEST>

An experimental option for turning trampolines back on exists.

It should be possible to enhanced this with the WIP [Zjid](https://github.com/riscv/riscv-j-extension) by changing the JALR to JAL and nop out the auipc+ld (as the current proposal of Zjid forces the I-fetcher to fetch instruction in order (meaning we will avoid a lot issues which arm has)) when in reach and vice-versa.

Numbers from VF2 (I have done them a few times, they are always overall in favor of this patch):

fop                                        (msec)    2239       |  2128       =  0.950424
h2                                         (msec)    18660      |  16594      =  0.889282
jython                                     (msec)    22022      |  21925      =  0.995595
luindex                                    (msec)    2866       |  2842       =  0.991626
lusearch                                   (msec)    4108       |  4311       =  1.04942
lusearch-fix                               (msec)    4406       |  4116       =  0.934181
pmd                                        (msec)    5976       |  5897       =  0.98678
jython                                     (msec)    22022      |  21925      =  0.995595
Avg:                                       0.974112                              
fop(xcomp)                                 (msec)    2721       |  2714       =  0.997427
h2(xcomp)                                  (msec)    37719      |  38004      =  1.00756
jython(xcomp)                              (msec)    28563      |  29470      =  1.03175
luindex(xcomp)                             (msec)    5303       |  5512       =  1.03941
lusearch(xcomp)                            (msec)    6702       |  6271       =  0.935691
lusearch-fix(xcomp)                        (msec)    6721       |  6217       =  0.925011
pmd(xcomp)                                 (msec)    6835       |  6587       =  0.963716
jython(xcomp)                              (msec)    28563      |  29470      =  1.03175
Avg:                                       0.99154                               
o.r.actors.JmhAkkaUct.run                  (ms/op)   8585.440   |  7548.347   =  0.879203
o.r.actors.JmhReactors.run                 (ms/op)   65004.694  |  64448.824  =  0.991449
o.r.jdk.concurrent.JmhFjKmeans.run         (ms/op)   47751.653  |  45747.490  =  0.958029
o.r.jdk.concurrent.JmhFutureGenetic.run    (ms/op)   12083.628  |  11427.650  =  0.945713
o.r.jdk.streams.JmhMnemonics.run           (ms/op)   32691.025  |  31002.088  =  0.948336
o.r.jdk.streams.JmhParMnemonics.run        (ms/op)   27500.431  |  23747.117  =  0.863518
o.r.jdk.streams.JmhScrabble.run            (ms/op)   3688.182   |  3528.943   =  0.956825
o.r.neo4j.JmhNeo4jAnalytics.run            (ms/op)   20153.371  |  21704.731  =  1.07698
o.r.rx.JmhRxScrabble.run                   (ms/op)   1197.749   |  1160.465   =  0.968872
o.r.scala.dotty.JmhDotty.run               (ms/op)   18385.552  |  18561.341  =  1.00956
o.r.scala.sat.JmhScalaDoku.run             (ms/op)   25243.887  |  22112.289  =  0.875946
o.r.scala.stdlib.JmhScalaKmeans.run        (ms/op)   2610.509   |  2498.539   =  0.957108
o.r.scala.stm.JmhPhilosophers.run          (ms/op)   5875.997   |  6101.689   =  1.03841
o.r.scala.stm.JmhScalaStmBench7.run        (ms/op)   8723.122   |  8760.115   =  1.00424
o.r.twitter.finagle.JmhFinagleChirper.run  (ms/op)   21209.541  |  21732.213  =  1.02464
o.r.twitter.finagle.JmhFinagleHttp.run     (ms/op)   20782.221  |  20390.960  =  0.981173
Avg:                                       0.9675            

It's been throught a couple of t1-t3, but I need to re-run test after latest merge.

-------------

Commit messages:
 - Remove accidental files
 - Remove accidental files
 - Baseline

Changes: https://git.openjdk.org/jdk/pull/19453/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=19453&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8332689
  Stats: 802 lines in 15 files changed: 595 ins; 103 del; 104 mod
  Patch: https://git.openjdk.org/jdk/pull/19453.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/19453/head:pull/19453

PR: https://git.openjdk.org/jdk/pull/19453