RFR: 8299162: Refactor shared trampoline emission logic [v3]

Fri Feb 3 10:46:53 UTC 2023

On Fri, 3 Feb 2023 08:16:24 GMT, Xiaolin Zheng <xlinzheng at openjdk.org> wrote:

>> After the quick fix [JDK-8297763](https://bugs.openjdk.org/browse/JDK-8297763), shared trampoline logic gets a bit verbose. If we can turn to batch emission of trampoline stubs, pre-calculating the total size, and pre-allocating them, then we can remove the CodeBuffer expansion checks each time and clean up the code around.
>> 
>> 
>> [Stub Code]
>> ...
>> <shared trampoline stub1, (A):>
>>   __ align()  // emit nothing or a 4-byte padding
>>               <-- (B) emit multiple relocations at the pc: __ relocate(<the pc here>, trampoline_stub_Relocation::spec())
>>   __ ldr()
>>   __ br()
>>   __ emit_int64()
>> <shared trampoline stub2, (C):>
>>   __ align()  // emit nothing or a 4-byte padding
>>               <-- emit multiple relocations at the pc: __ relocate(<the pc here>, trampoline_stub_Relocation::spec())
>>   __ ldr()    
>>   __ br()
>>   __ emit_int64()
>> <shared trampoline stub3:>
>>   __ align()  // emit nothing or a 4-byte padding
>>               <-- emit multiple relocations at the pc: __ relocate(<the pc here>, trampoline_stub_Relocation::spec())
>>   __ ldr()
>>   __ br()
>>   __ emit_int64()
>> 
>> 
>> Here, the `pc_at_(C) - pc_at_(B)` is the fixed length `NativeCallTrampolineStub::instruction_size`; but the `pc_at_(B) - pc_at_(A)` may be a 0 or 4, which is not a fixed-length value.
>> 
>> So originally:
>> The logic of the lambda `emit()` inside the `emit_shared_trampolines()` when emitting a shared trampoline:
>> 
>> We are at (A) -> 
>> do an align() -> 
>> We are at (B) -> 
>> emit lots of relocations bound to this shared trampoline at (B) -> 
>> do an emit_trampoline_stub() -> 
>> We are at (C)
>> 
>> 
>> After this patch:
>> 
>> We are at (A) -> 
>> do an emit_trampoline_stub(), which contains an align() already ->
>> We are at (C) directly ->
>> reversely calculate the (B) address, for `pc_at_(C) - pc_at_(B)` is a fixed-length value -> 
>> emit lots of relocations bound to this shared trampoline at (B)
>> 
>> 
>> Theoretically the same. Just a code refactoring and we can remove some checks inside and make the code clean.
>> 
>> Tested AArch64 hotspot tier1~4 with fastdebug build twice; Tested RISC-V hotspot tier1~4 with fastdebug build on hardware once.
>> 
>> 
>> Thanks,
>> Xiaolin
>
> Xiaolin Zheng has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision:
> 
>  - license
>  - Merge branch 'master' into shared-trampoline
>  - __
>  - Refactor and cleanup for shared trampolines

AArch64 changes look ok to me.

src/hotspot/cpu/aarch64/codeBuffer_aarch64.cpp line 71:

> 69: 
> 70:   assert(requests->number_of_entries() >= 1, "at least one");
> 71:   const int total_requested_size = MacroAssembler::trampoline_stub_size() * requests->number_of_entries();

There's a detail here that is worth mentioning. At first I was concerned that this calculation might overestimate the required size by one 32 bit word -- that happens where we are able generate the first stub without the need for any alignment padding. However, that won't cause any unnecessary failure because the rounding up will be from an odd to an even multiple of 32 bit words i.e. adding this extra half-word will not push the request over to consume an extra 64 bit word.

-------------

Marked as reviewed by adinn (Reviewer).

PR: https://git.openjdk.org/jdk/pull/11749