Pre-RFR(L): 8206139: [lworld] Re-enable calling convention optimizations

Fri Nov 30 15:59:32 UTC 2018

Hi,

please pre-review the following change that re-enables the calling convention optimizations:
https://bugs.openjdk.java.net/browse/JDK-8206139
http://cr.openjdk.java.net/~thartmann/8206139/webrev.00/

This change fixes, cleans and re-enables all the rotting code from MVT times.
ValueTypePassFieldsAsArgs and ValueTypeReturnedAsFields are now re-enabled by default on x86_64.

The major difference to MVT is a new nmethod entry point that is used by method handle linkTo* calls
to unpack value types passed as oops (see changes in MethodHandles::jump_from_method_handle()). This
solves the problem where a compiled method with scalarized value type arguments could be called from
compiled code through a linkTo* with value type arguments passed as oops.

If the option StressValueTypePassFieldsAsArgs is enabled, this entry is also used for normal i2c
calls (instead of doing the unpacking in the i2c adapter). A similar entry point will be used for
interface calls (see "Limitations" section below).

The difficulty here is that unpacking of value type arguments in the new entry point might require
additional stack space. For example, let's assume we have a method that takes two value type
arguments v1 and v2 of a type MyValue1 that has 4 fields f1 - f4.

The register/stack layout after method entry for an un-scalarized call would look like this:

(1)
  rsi: v1
  rdx: v2
  rcx:
   r8:
   r9:
  rdi:
 0x18: ...    <- sp of caller
 0x10: return_address
 0x08: rbp
 0x00: locals

And for the scalarized call it would look like this:

(2)
  rsi: v1.f1
  rdx: v1.f2
  rcx: v1.f3
   r8: v1.f4
   r9: v2.f1
  rdi: v2.f2
 0x28: ...    <- sp of caller
 0x20: v2.f4
 0x18: v2.f3
 0x10: return_address
 0x08: rbp
 0x00: locals

In this case, the scalarized calling convention requires two additional stack slots to pass v2.f3
and v2.f4. Because we cannot overwrite stack slots in the caller frame, we need to extend the callee
frame before unpacking.

I've investigated several different options like frameful adapters or internal wrapper methods that
would call the entry point but came to the conclusion that the following approach is least invasive
and least complex.

The scalarized stack layout now has a "reserved" entry that holds the return address of the
un-scalarized calling convention and a 'sp_inc' value above the locals that contains the increment
to repair the stack on exit:

(2*)
  rsi: v1.f1
  rdx: v1.f2
  rcx: v1.f3
   r8: v1.f4
   r9: v2.f1
  rdi: v2.f2
 0x38: ...    <- sp of caller
 0x30: [RESERVED] <- might contain return_address
 0x28: v2.f4
 0x20: v2.f3
 0x18: return_address
 0x10: rbp
 0x08: sp_inc = 0x38
 0x00: locals

To convert between (1) and (2*), the new entry point does the following:

[Verified Value Entry Point]
 pop    %r13             ; save return address
 sub    $0x20,%rsp       ; extend stack (make sure stack remains aligned)
 push   %r13             ; copy return address to top of stack
 ...                     ; unpack value type arguments
 push   %rbp             ; save rbp
 sub    $0x10,%rsp       ; create frame (constant size)
 movq   $0x38,0x8(%rsp)  ; save stack increment 0x10 + size(rbp) + 0x20
 jmpq   [Entry]

[Verified Entry Point]
 push   %rbp             ; save rbp
 sub    $0x10,%rsp       ; create frame (constant size)
 movq   $0x18,0x8(%rsp)  ; save stack increment 0x10 + size(rbp)

[Entry]
 ...                     ; method body
 mov    0x10(%rsp),%rbp  ; restore rbp
 add    0x8(%rsp),%rsp   ; repair stack
 retq

This allows an efficient conversion between the calling conventions with minimal impact to
scalarized code. The only difference is the epilog that used to have a constant stack increment:

 ..                     ; method body
 add    $0x10,%rsp      ; repair stack
 pop    %rbp            ; restore rbp
 retq

Stack walking in frame::safe_for_sender and frame::sender_for_compiled_frame now needs to repair the
stack to get a correct sender_sp if the current frame extended the stack. This is done in
frame::repair_sender_sp() by adding the sp_inc if necessary. Deoptimization uses the same stack
walking code and is therefore not affected.

Unfortunately, unpacking of value type arguments is not trivial because we have to be careful not to
overwrite registers/stack slots that we still need to read from (and there might be circular
dependencies that require spilling). I've solved this by keeping track of register/stack slot states
and iteratively trying to unpack. If we don't succeed, we will start spilling (see
MacroAssembler::unpack_value_args()). A better way would be to generate this code by C2 and let the
register allocator find the best way to do it.

Significant changes were also required to get the location of the reserved slot in the argument
list, re-compute offsets of the other stack arguments and make sure it's not used by compiled code.

Besides lots of refactoring, this fix also includes the following changes:
- Implemented additional indirection for accessing (un-)pack_handler_offset in ValueKlassFixedBlock
- Don't use store_heap_oop() in the adapters/entry because it trashes argument registers
- Completely removed T_VALUETYPEPTR which was only used by JIT code
- Re-enabled compilation of lambda forms as root of compilation
- Set 'caller_must_gc_arguments' for the c2i adapter if it allocates value types because it might
safepoint and trigger a GC (in this case the GC needs to know about the location of oop arguments
passed to the c2i adapter)
- InterpreterMacroAssembler::remove_activation() and
TemplateInterpreterGenerator::generate_return_entry_for() need to check the return value for being a
value type because we cannot statically tell anymore without a 'qtos' state.

Limitations:
- The code was not yet ported to LW2.
- Many of the TestMethodHandles tests don't fully optimize anymore because lambda forms are erased
to Object instead of __Value. I temporarily disabled IR verification for this test but we might be
able to do better (needs more investigation).
- Value type receivers are currently not scalarized because of interface calls. We need another
entry point that is used by interface call sites and only unpacks the receiver.
- If NullableValueTypes are enabled, the calling convention is disabled. This will be fixed with the
port to LW2.
- *All* tests pass on Linux and Mac but I see some weird compilation failures on Windows.

I'll work on these limitations next.

Thanks,
Tobias