RFR(L): 8206139: [lworld] Re-enable calling convention optimizations

Mon Dec 17 14:33:41 UTC 2018

Hi,

I've ported the code to LW2:
http://cr.openjdk.java.net/~thartmann/8206139/webrev.01/

The new webrev also includes these changes:
- Handle nullable value types (no scalarization)
- Disable the calling convention with EnableValhallaC1 and issue a warning
- Frederic noticed that the code in InterpreterMacroAssembler::remove_activation() always unpacks
the returned value type, even if the method returns an Object (in which case the caller does not
require scalarization). I've optimized this by checking the result type of the ConstMethod.
- Added a test that calls a method that has circular register/stack dependencies when unpacking
value type arguments.

I've noticed that with Frederic's fix for 8214138 which removed the value type attribute consistency
checks, we also stopped throwing a NoClassDefFoundError when a value type argument cannot be
resolved at link time. The calling convention code expects all value type klasses of the signature
to be loaded at that time (Method::link_method -> ... -> AdapterHandlerLibrary::get_adapter0)
assuming that we would have failed already if the class could not be found. I've temporarily fixed
that by some code in method.cpp:1096.

If there are no objections, I would like to push this code and fix the remaining issues with
follow-up enhancements:
- Value type receivers are currently not scalarized because of interface calls. We need another
entry point that is used by interface call sites and only unpacks the receiver.
- Many of the TestMethodHandles tests don't fully optimize anymore because lambda forms are erased
to Object instead of __Value. I temporarily disabled IR verification for this test but we might be
able to do better (needs more investigation).

Thanks,
Tobias

On 30.11.18 16:59, Tobias Hartmann wrote:
> Hi,
> 
> please pre-review the following change that re-enables the calling convention optimizations:
> https://bugs.openjdk.java.net/browse/JDK-8206139
> http://cr.openjdk.java.net/~thartmann/8206139/webrev.00/
> 
> This change fixes, cleans and re-enables all the rotting code from MVT times.
> ValueTypePassFieldsAsArgs and ValueTypeReturnedAsFields are now re-enabled by default on x86_64.
> 
> The major difference to MVT is a new nmethod entry point that is used by method handle linkTo* calls
> to unpack value types passed as oops (see changes in MethodHandles::jump_from_method_handle()). This
> solves the problem where a compiled method with scalarized value type arguments could be called from
> compiled code through a linkTo* with value type arguments passed as oops.
> 
> If the option StressValueTypePassFieldsAsArgs is enabled, this entry is also used for normal i2c
> calls (instead of doing the unpacking in the i2c adapter). A similar entry point will be used for
> interface calls (see "Limitations" section below).
> 
> The difficulty here is that unpacking of value type arguments in the new entry point might require
> additional stack space. For example, let's assume we have a method that takes two value type
> arguments v1 and v2 of a type MyValue1 that has 4 fields f1 - f4.
> 
> The register/stack layout after method entry for an un-scalarized call would look like this:
> 
> (1)
>   rsi: v1
>   rdx: v2
>   rcx:
>    r8:
>    r9:
>   rdi:
>  0x18: ...    <- sp of caller
>  0x10: return_address
>  0x08: rbp
>  0x00: locals
> 
> And for the scalarized call it would look like this:
> 
> (2)
>   rsi: v1.f1
>   rdx: v1.f2
>   rcx: v1.f3
>    r8: v1.f4
>    r9: v2.f1
>   rdi: v2.f2
>  0x28: ...    <- sp of caller
>  0x20: v2.f4
>  0x18: v2.f3
>  0x10: return_address
>  0x08: rbp
>  0x00: locals
> 
> In this case, the scalarized calling convention requires two additional stack slots to pass v2.f3
> and v2.f4. Because we cannot overwrite stack slots in the caller frame, we need to extend the callee
> frame before unpacking.
> 
> I've investigated several different options like frameful adapters or internal wrapper methods that
> would call the entry point but came to the conclusion that the following approach is least invasive
> and least complex.
> 
> The scalarized stack layout now has a "reserved" entry that holds the return address of the
> un-scalarized calling convention and a 'sp_inc' value above the locals that contains the increment
> to repair the stack on exit:
> 
> (2*)
>   rsi: v1.f1
>   rdx: v1.f2
>   rcx: v1.f3
>    r8: v1.f4
>    r9: v2.f1
>   rdi: v2.f2
>  0x38: ...    <- sp of caller
>  0x30: [RESERVED] <- might contain return_address
>  0x28: v2.f4
>  0x20: v2.f3
>  0x18: return_address
>  0x10: rbp
>  0x08: sp_inc = 0x38
>  0x00: locals
> 
> To convert between (1) and (2*), the new entry point does the following:
> 
> [Verified Value Entry Point]
>  pop    %r13             ; save return address
>  sub    $0x20,%rsp       ; extend stack (make sure stack remains aligned)
>  push   %r13             ; copy return address to top of stack
>  ...                     ; unpack value type arguments
>  push   %rbp             ; save rbp
>  sub    $0x10,%rsp       ; create frame (constant size)
>  movq   $0x38,0x8(%rsp)  ; save stack increment 0x10 + size(rbp) + 0x20
>  jmpq   [Entry]
> 
> [Verified Entry Point]
>  push   %rbp             ; save rbp
>  sub    $0x10,%rsp       ; create frame (constant size)
>  movq   $0x18,0x8(%rsp)  ; save stack increment 0x10 + size(rbp)
> 
> [Entry]
>  ...                     ; method body
>  mov    0x10(%rsp),%rbp  ; restore rbp
>  add    0x8(%rsp),%rsp   ; repair stack
>  retq
> 
> This allows an efficient conversion between the calling conventions with minimal impact to
> scalarized code. The only difference is the epilog that used to have a constant stack increment:
> 
>  ..                     ; method body
>  add    $0x10,%rsp      ; repair stack
>  pop    %rbp            ; restore rbp
>  retq
> 
> Stack walking in frame::safe_for_sender and frame::sender_for_compiled_frame now needs to repair the
> stack to get a correct sender_sp if the current frame extended the stack. This is done in
> frame::repair_sender_sp() by adding the sp_inc if necessary. Deoptimization uses the same stack
> walking code and is therefore not affected.
> 
> Unfortunately, unpacking of value type arguments is not trivial because we have to be careful not to
> overwrite registers/stack slots that we still need to read from (and there might be circular
> dependencies that require spilling). I've solved this by keeping track of register/stack slot states
> and iteratively trying to unpack. If we don't succeed, we will start spilling (see
> MacroAssembler::unpack_value_args()). A better way would be to generate this code by C2 and let the
> register allocator find the best way to do it.
> 
> Significant changes were also required to get the location of the reserved slot in the argument
> list, re-compute offsets of the other stack arguments and make sure it's not used by compiled code.
> 
> Besides lots of refactoring, this fix also includes the following changes:
> - Implemented additional indirection for accessing (un-)pack_handler_offset in ValueKlassFixedBlock
> - Don't use store_heap_oop() in the adapters/entry because it trashes argument registers
> - Completely removed T_VALUETYPEPTR which was only used by JIT code
> - Re-enabled compilation of lambda forms as root of compilation
> - Set 'caller_must_gc_arguments' for the c2i adapter if it allocates value types because it might
> safepoint and trigger a GC (in this case the GC needs to know about the location of oop arguments
> passed to the c2i adapter)
> - InterpreterMacroAssembler::remove_activation() and
> TemplateInterpreterGenerator::generate_return_entry_for() need to check the return value for being a
> value type because we cannot statically tell anymore without a 'qtos' state.
> 
> Limitations:
> - The code was not yet ported to LW2.
> - Many of the TestMethodHandles tests don't fully optimize anymore because lambda forms are erased
> to Object instead of __Value. I temporarily disabled IR verification for this test but we might be
> able to do better (needs more investigation).
> - Value type receivers are currently not scalarized because of interface calls. We need another
> entry point that is used by interface call sites and only unpacks the receiver.
> - If NullableValueTypes are enabled, the calling convention is disabled. This will be fixed with the
> port to LW2.
> - *All* tests pass on Linux and Mac but I see some weird compilation failures on Windows.
> 
> I'll work on these limitations next.
> 
> Thanks,
> Tobias
>