RFR(L): 8185265 [MVT] improve performance of return of value types with new calling convention

Wed Aug 9 12:15:28 UTC 2017

Hi Roland,

nice work but lots of changes :) I tried to have a detailed look.

sharedRuntime_x86_64.cpp
- What is MacroAssembler* masm used for?
- Maybe call the buffer "value type transform" or "value type packing/unpacking" instead of "value types"

macro.cpp
- I think the comment in line 2643 should be adjusted because we may now either call into the runtime or into the 
pack_handler.
- There is a UseTLAB assert but I don't see where this is checked when adding the macro node. Also, we could fall back 
to the runtime call if UseTLAB is disabled (like we do in the interpreter), right?
- Your usage of "Node* n" vs. "Node *n" is a bit inconsistent (I prefer "Node* n")
- The "needgc" naming is a bit confusing, this isn't really about the GC but about if we have enough space in the TLAB.

castnode.cpp
- Please add a comment to the mask check explaining the two cases (returns as fields vs. returned as oop)

templateInterpreterGenerator_x86.cpp
- As I understand, a pack/unpack handler is always set if the value type can be returned in registers. Why do we need 
the pack_handler != 0 check? Isn't the __ testptr(rax, 1) sufficient?

ValueTypeTestBench.java:
- Why did you change COMP_LEVEL_ANY to -2?
- Should we enable StressValueTypeReturnedAsFields in our tests?

I've run this through JPRT and found the following crash during class unloading with some internal test on Solaris and 
Windows:

#  SIGSEGV (0xb) at pc=0xffff80fb73dff109, pid=28909, tid=19
#
# JRE version: OpenJDK Runtime Environment (10.0) (build 10-internal+0-2017-08-08-063310.tobias.valhalla)
# Java VM: OpenJDK 64-Bit Server VM (10-internal+0-2017-08-08-063310.tobias.valhalla, compiled mode, compressed oops, g1 
gc, solaris-amd64)
# Problematic frame:
# V  [libjvm.so+0xfff109]  void BufferBlob::free(BufferBlob*)+0xd9

Stack: [0xffff80ffb2efd000,0xffff80ffb2ffd000],  sp=0xffff80ffb2ffbe50,  free space=1019k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0xfff109]  void BufferBlob::free(BufferBlob*)+0xd9
V  [libjvm.so+0xfabf43]  _$c1A.__1cPClassLoaderDataQvalue_classes_do6MpFpnKValueKlass__v_v_+0x33
V  [libjvm.so+0xfab033]  bool ClassLoaderDataGraph::do_unloading(BoolObjectClosure*,bool)+0x233
V  [libjvm.so+0x1ae61e6]  bool SystemDictionary::do_unloading(BoolObjectClosure*,bool)+0x16
V  [libjvm.so+0x11b91d7]  void G1MarkSweep::mark_sweep_phase1(bool&,bool)+0x417
V  [libjvm.so+0x11b8cfb]  void G1MarkSweep::invoke_at_safepoint(ReferenceProcessor*,bool)+0x7b
V  [libjvm.so+0x118a045]  bool G1CollectedHeap::do_full_collection(bool,bool)+0x465
V  [libjvm.so+0x1bd7d6c]  void VM_G1CollectFull::doit()+0x6c
V  [libjvm.so+0x1bd5d48]  void VM_Operation::evaluate()+0xd8
V  [libjvm.so+0x1bd3d61]  void VMThread::evaluate_operation(VM_Operation*)+0x161
V  [libjvm.so+0x1bd4589]  void VMThread::loop()+0x2d9
V  [libjvm.so+0x1bd3a3e]  void VMThread::run()+0x7e
V  [libjvm.so+0x1905060]  thread_native_entry+0x240
C  [libc.so.1+0x125221]  _thrp_setup+0xa5
C  [libc.so.1+0x1254c0]  _lwp_start+0x0

Best regards and greetings from Bangalore,
Tobias

On 05.08.2017 01:12, Roland Westrelin wrote:
> 
> (that patch includes some runtime/interpreter changes)
> 
> http://cr.openjdk.java.net/~roland/8185265/webrev.00/
> 
> JDK-8184795 disabled compilation of LFs as compilation root if they
> return a value type. The logic in CheckCastPPNode::Ideal() expects
> values to be returned in registers but if the called method is a lambda
> form, a value can be returned as a buffered value. I fixed the logic in
> CheckCastPPNode::Ideal() to properly handle return of a buffered value.
> 
> One of Sergey's micro benchmarks takes forever to warmup. I found it's
> related to the return of value types and, in particular, calling the
> runtime to pack/unpack values (i.e. load fields in registers on return,
> allocate a buffered value and initialize it once returned to the
> caller).  It causes a ~10x slow down during warmup. To address this, the
> VM now generates little pieces of assembly code to perform the packing
> or unpacking. On return, unpacking a buffered value is performed by
> calling the value klass's unpack unhandler. When returned in the caller,
> a fast path allocation is attempted for a new value from the TLAB (I
> left off heap buffers out for now). If that allocation succeeds the
> value klass pack handler is called an initializes the value. If the
> allocation fails, we fall back to the same runtime call that is used
> currently (that calls knows there can be live oops, where to look for
> them and how to preserve them across a GC).
> 
> C2 doesn't support calling to an address that is loaded in a register. I
> hacked support for it from the existing CallLeafNoFP node. It's somewhat
> ugly but I would say it's good enough for now.
> 
> With this change, AFAICT, the warmup issue caused by return of value
> types in Sergey's test is gone. It doesn't mean it warmups fast (we
> still have some disabled compilation of LF for another reason) but it's
> now decent.
> 
> Roland.
>