RFR: 8285487: Do not generate trampolines for runtime calls if they are not needed [v3]

Fri Apr 29 21:24:46 UTC 2022

On Fri, 29 Apr 2022 19:32:36 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> I'll update `MacroAssembler::trampoline_call` doc when it should be used. All runtime calls into `CodeCache` must use `far_call`. `far_call` is more compact and faster vs `trampoline_call`:
>> far call:
>> 
>> adrp
>> add
>> blr
>> 
>> trampoline call:
>> 
>> bl trampoline
>> trampoline:
>> ldr [embedded]
>> br
>> embedded:
>> 8 bytes of target address
>> 
>> 
>> What do you think?
>
> To be clear. `relocInfo::runtime_call_type` is used for stubs calls which are inside CodeCache and for VM runtime calls which are outside CodeCache. Depending on call's distance you can use all 3 types of instruction for all cases - even VM runtime method could be near CodeCache so you can use "near" call. So when you are asking what type of call should be generated you can get all 3 answers.
> 
> The optimization is that for other calls (inside CodeCache) you need only 2 types of call: far and near. And `MacroAssembler::far_call()` handles it correctly.
> 
> Back to `trampoline_call()` function.
> 1. The header comment is incorrect: "If the code cache is small trampolines won't be emitted."
> 2. New `assert(CodeCache::contains(entry.target())` is wrong.
> 3. I would like to know why `trampoline_call()` could be used for `relocInfo::virtual_call_type` and `relocInfo::opt_virtual_call_type` (based on assert). They should be inside CodeCache and you don't need trampoline for it. Right? Unless you generate code outside CodeCache.
> 4. I don't understand how call to trampoline is handled (lines 603-607) they do not reference `stub` value returned by `emit_trampoline_stub()`. Why not use `far_call()` to call trampoline code?
> 5. `target_needs_far_branch()` works only inside CodeCache. You need separate `target_needs_trampoline()` for `relocInfo::runtime_call_type`. New method should check distance to target vs CodeCache boundaries like we do for x86: https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/assembler_x86.cpp#L11701
> 
> I think the code could be like this:
> 
> 
>     if (entry.rspec().type() != relocInfo::runtime_call_type ||
>          CodeCache::find_blob(entry.target()) != NULL) {
>       // Call inside CodeCache
>       far_call(entry, cbuf, rscratch1);
>       return pc();
>     }
>     if (target_needs_trampoline(entry)) { // Check distance from CodeCache
>       // Generate trampoline to call outside CodeCache
>       stub = emit_trampoline_stub();
>       // Call to trampoline
>       far_call(stub, cbuf, rscratch1);
>     } else {
>       // Far call outside CodeCache.
>       // Near call can't be used because generated code could be posted in any place in CodeCache when installed.
>       adrp
>       add
>       blr
>     }
>     return pc();

Hi Vladimir,
Thank you for the detailed comment. 

> The optimization is that for other calls (inside CodeCache) you need only 2 types of call: far and near. And  MacroAssembler::far_call() handles it correctly.

Yes., I've checked the uses of `trampoline_call`. There are no calls outside CodeCache. They must be replaced with `far_call`. I did this and got a crash of the debug build. Debug builds reduce the branch range from 128M to 2M to be able to test trampolines. Debug builds with the 128M branch range work. With the 2M branch range the debug build fails at:

ScopeDesc* CompiledMethod::scope_desc_at(address pc) {
   PcDesc* pd = pc_desc_at(pc);
   guarantee(pd != NULL, "scope must be present");
   return new ScopeDesc(this, pd);
}

with the stack:

V  [libjvm.so+0xa8f138]  CompiledMethod::scope_desc_at(unsigned char*)+0x40
V  [libjvm.so+0x1567290]  compiledVFrame::compiledVFrame(frame const*, RegisterMap const*, JavaThread*, CompiledMethod*)+0xc8
V  [libjvm.so+0x155f1f0]  vframe::new_vframe(frame const*, RegisterMap const*, JavaThread*)+0xd8
V  [libjvm.so+0xaeb574]  Deoptimization::uncommon_trap_inner(JavaThread*, int)+0x1b4
V  [libjvm.so+0xaeca44]  Deoptimization::uncommon_trap(JavaThread*, int, int)+0x24

> I would like to know why trampoline_call() could be used for relocInfo::virtual_call_type and relocInfo::opt_virtual_call_type (based on assert). They should be inside CodeCache and you don't need trampoline for it. Right? 

If CodeCashe is bigger than 128M, we need a trampoline for them.
We generate:

bl trampoline
trampoline:

Initially, we have a call of `resolve_virtual_call_stub`. If the stub finds a compiled method, it patches:
- `bl` if the found method is near.
- `trampoline` if the found method is far.
For the 240M CodeCache, we patch `trampoline` when a C2 compiled method calls a C1 compiled method or vice verse. When a C2 compiled method calls a C2 compiled method or a C1 compiled method calls a C1 compiled method, we have a normal direct call.
If we used a `far_call` there, we would always have an indirect call because a call site would be:

adrp
add
blr

So the use of `trampoline_call` here is an optimization.

>I don't understand how call to trampoline is handled (lines 603-607) they do not reference stub value returned by emit_trampoline_stub(). Why not use far_call() to call trampoline code?

And they should not. Here it is `CodeBuffer`, calls are not linked yet. Trampoline code is put into the stub code section of `CodeBuffer`. We insert a self-calling instruction. When we move the generated code into `CodeCache` we link the calls by patching them based on `relocInfo` records. I think during linking we try to bypass trampolines if possible.

-------------

PR: https://git.openjdk.java.net/jdk/pull/8403