[foreign-abi] Intrinsify down calls

Mon Apr 6 11:28:44 UTC 2020

Impressive progress, Jorn!

I like how GraphKit::make_native_call() et al shapes and encapsulates 
NativeEntryPoint-related logic.

I have some comments on native call representation in C2.

src/hotspot/cpu/x86/x86_64.ad:

+int MachCallNativeNode::ret_addr_offset() {
+  // FIXME return size of emitted code? What to do here?
+ return 0;
+}

MachCallNode::ret_addr_offset() is used to determine proper instruction 
address for a safepoint and it usually points to the instruction right 
after the call instruction.

It's not your case though: JNI calls go through a native stub which 
performs both thread state transition and calls into native entry point.
The stub itself it treated specially, so there's no need to record 
proper address for them, but you have to. Otherwise, JVM will have 
problems finding relevant safepoint information.

It leads to the following suggestion: what do you think about inlining 
the stub only when state transtition is omitted? When it's not, a 
special stub is generated and used (which obeys the call convention as 
much as possible). In both cases, you'll end up with a single call 
instruction in the generated code, but in the former case it calls 
directly into the native code while in the latter case it goes through 
the relevant stub.

Another important question is how register conflicts between VM and 
native code is handled. For example, when an argument/return value 
occupies the register which VM uses for its own purposes (r12/r15 on 
x86_64).

Some minor comments:

src/hotspot/cpu/x86/x86_64.ad:

+// Unpack native results
+witch (_return_type) {

It's better to reflect "unpacking" in the IR (as a separate pure node) 
than to hard-code it into the call logic.

src/hotspot/share/opto/output.cpp:

-          if (mcall->is_MachCallLeaf()) {
+          if (mcall->is_MachCallLeaf() || (mcall->is_MachCallNative()
+              && !(mcall->as_MachCallNative()->_need_transition))) {
+              // skip observing safepoint below (needs JVMS)

Probably, it makes sense to capture CallNode::guaranteed_safepoint() 
during matching and use it instead here.

Best regards,
Vladimir Ivanov

[1]

On 25.03.2020 19:08, Jorn Vernee wrote:
> Hi,
> 
> I have done some work looking at intrinsification to speed up calls.
> 
> For down calls there are 2 areas that can be improved:
> - Instead of interpreting a binding recipe for a call, we can use 
> MethodHandle combinators to create a specialized MethodHandle for 
> executing the steps of a binding recipe.
> - When inlining a native MethodHandle, C2 can instead emit a direct call 
> to the target function, instead of using an intermediate buffer to store 
> the arguments (but borrowing some of the information it has on input and 
> output registers).
> 
> I have an experimental implementation of this uploaded here: 
> https://github.com/openjdk/panama-foreign/compare/foreign-abi...JornVernee:Call_Intrinsics 
> 
> 
> This is based on the ideas of linkToNative, but is mostly a re-write, 
> since linkToNative did not support ABI customization. The new 
> implementation also uses a fallback MethodHandle as the fallback 
> implementation until C2 kicks in, rather than generating a specialized 
> stub eagerly.
> 
> Some numbers from the newly added CallOverhead benchmark [1]:
> 
> Benchmark                             Mode  Cnt   Score Error  Units
> CallOverhead.jni_blank                avgt   30   8.062 □ 0.153 ns/op
> CallOverhead.jni_identity             avgt   30  12.360 □ 0.050 ns/op
> CallOverhead.panama_blank             avgt   30   7.557 □ 0.025 ns/op
> CallOverhead.panama_blank_trivial     avgt   30   1.619 □ 0.003 ns/op
> CallOverhead.panama_identity          avgt   30  11.412 □ 0.023 ns/op
> CallOverhead.panama_identity_trivial  avgt   30   4.298 □ 0.008 ns/op
> 
> NO_INTRINSICS:
> Benchmark                             Mode  Cnt    Score Error  Units
> CallOverhead.jni_blank                avgt   30    7.963 □ 0.079  ns/op
> CallOverhead.jni_identity             avgt   30   12.227 □ 0.027  ns/op
> CallOverhead.panama_blank             avgt   30  193.799 □ 3.224  ns/op
> CallOverhead.panama_identity          avgt   30  237.137 □ 1.150  ns/op
> 
> NO_SPEC:
> Benchmark                             Mode  Cnt    Score Error  Units
> CallOverhead.jni_blank                avgt   30    8.064 □ 0.117  ns/op
> CallOverhead.jni_identity             avgt   30   12.381 □ 0.072  ns/op
> CallOverhead.panama_blank             avgt   30  193.705 □ 2.275  ns/op
> CallOverhead.panama_identity          avgt   30  292.271 □ 3.344  ns/op
> 
> The NO_SPEC benchmarks at the bottom are the status quo, the 
> NO_INTRINSICS benchmarks only do the Java side specialization, but not 
> the C2 specilization, and the benchmarks at the top are with everything 
> enabled. I've also experimented with an attribute that can be added to 
> FunctionDescriptor in case the function is small/trivial, which removes 
> the thread state transition, which are the *_trivial results. Note that 
> most native functions do not qualify for turning off thread state 
> transitions, so this is mostly to show the very minor difference (only 
> 6-7ns) in call overhead in case our target function is trivial.
> 
> For integrating this, I will probably split this work into 3 patches to 
> make reviewing easier:
> 1. the CallOverhead benchmark
> 2. the Java side specialization
> 3. the C2 support
> 
> Cheers,
> Jorn
> 
> [1] : 
> https://github.com/openjdk/panama-foreign/compare/foreign-abi...JornVernee:Call_Intrinsics#diff-5234454e5c0aa31251dd12fbd3a10319 
> 
>