RFR: 8373696: AArch64: Refine post-call NOPs

Thu Dec 18 13:55:41 UTC 2025

On Tue, 16 Dec 2025 22:45:08 GMT, Ruben <duke at openjdk.org> wrote:

> Extend MOVK-based scheme to MOVK/MOVZ allowing to store 19 bits of metadata.
> 
> Choose number of metadata slots in post-call NOP sequence between 1 and 2 depending on the offset from the CodeBlob header.
> 
> Additionally, implement ADR/ADRP-based metadata storage - that provides 22 bits instead of 16 bits to store metadata. This can be enabled via UsePostCallSequenceWithADRP option.
> 
> 
>  Renaissance 0.15.0 benchmark results (MOVK-based scheme)
>  Neoverse V1.
>  The runs were limited to 16 cores.
> 
>  Number of runs:
>    6 for baseline, 6 for the changes - interleaved pairs.
> 
>  Command line:
>   java -jar renaissance-jmh-0.15.0.jar \
>     -bm avgt -gc true -v extra \
>     -jvmArgsAppend '-Xbatch -XX:-UseDynamicNumberOfCompilerThreads \
>       -XX:-CICompilerCountPerCPU -XX:ActiveProcessorCount=16 \
>       -XX:CICompilerCount=2 -Xms8g -Xmx8g -XX:+AlwaysPreTouch \
>       -XX:+UseG1GC'
> 
>  The change is geometric mean of ratios across 6 the pairs of runs.
> 
>   |  Benchmark                                            |  Change  | 90% CI for the change |
>   | ----------------------------------------------------- | -------- | --------------------- |
>   | org.renaissance.actors.JmhAkkaUct.run                 |  -0.215% |    -2.652% to  1.357% |
>   | org.renaissance.actors.JmhReactors.run                |  -0.166% |    -1.974% to  1.775% |
>   | org.renaissance.jdk.concurrent.JmhFjKmeans.run        |   0.222% |    -0.492% to  0.933% |
>   | org.renaissance.jdk.concurrent.JmhFutureGenetic.run   |  -1.880% |    -2.438% to -1.343% |
>   | org.renaissance.jdk.streams.JmhMnemonics.run          |  -0.500% |    -1.032% to  0.089% |
>   | org.renaissance.jdk.streams.JmhParMnemonics.run       |  -0.740% |    -2.092% to  0.639% |
>   | org.renaissance.jdk.streams.JmhScrabble.run           |  -0.031% |    -0.353% to  0.310% |
>   | org.renaissance.neo4j.JmhNeo4jAnalytics.run           |  -0.873% |    -2.323% to  0.427% |
>   | org.renaissance.rx.JmhRxScrabble.run                  |  -0.512% |    -1.121% to  0.049% |
>   | org.renaissance.scala.dotty.JmhDotty.run              |  -0.219% |    -1.108% to  0.708% |
>   | org.renaissance.scala.sat.JmhScalaDoku.run            |  -2.750% |    -6.426% to -0.827% |
>   | org.renaissance.scala.stdlib.JmhScalaKmeans.run       |   0.046% |    -0.383% to  0.408% |
>   | org.renaissance.scala.stm.JmhPhilosophers.run         |   1.497% |    -0.955% to  3.923% |
>   | org.renaissance.scala.stm.JmhScalaStmBench7.run       |  -0.096% |    -0.773% to  0.586% |
>   | org.renaissance.twitter.finagle.J...

On 18/12/2025 13:28, Ruben wrote:
> The |B.nv| might not be suitable in this case - I believe the branch 
> will be "always taken".

Ah, so it is. That's a shame.

> However, I had considered using |CBNZ XZR, <#imm>|. So far, I avoided 
> implementing it because it is unclear what performance effects this 
> might have.

I've tried it, and it's definitely a lot slower than a NOP or a MOVZ.
It would be nice if we could persuade the architecture people to give as 
a NOP with payload, perhaps because "Intel has it."

But if we choose the split between CB offset and oopmap index wisely, we 
can avoid using a second instruction most of the time.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/28855#issuecomment-3670392141