[foreign-abi] Intrinsify down calls

Wed Mar 25 16:08:35 UTC 2020

Hi,

I have done some work looking at intrinsification to speed up calls.

For down calls there are 2 areas that can be improved:
- Instead of interpreting a binding recipe for a call, we can use 
MethodHandle combinators to create a specialized MethodHandle for 
executing the steps of a binding recipe.
- When inlining a native MethodHandle, C2 can instead emit a direct call 
to the target function, instead of using an intermediate buffer to store 
the arguments (but borrowing some of the information it has on input and 
output registers).

I have an experimental implementation of this uploaded here: 
https://github.com/openjdk/panama-foreign/compare/foreign-abi...JornVernee:Call_Intrinsics

This is based on the ideas of linkToNative, but is mostly a re-write, 
since linkToNative did not support ABI customization. The new 
implementation also uses a fallback MethodHandle as the fallback 
implementation until C2 kicks in, rather than generating a specialized 
stub eagerly.

Some numbers from the newly added CallOverhead benchmark [1]:

Benchmark                             Mode  Cnt   Score Error  Units
CallOverhead.jni_blank                avgt   30   8.062 □ 0.153 ns/op
CallOverhead.jni_identity             avgt   30  12.360 □ 0.050 ns/op
CallOverhead.panama_blank             avgt   30   7.557 □ 0.025 ns/op
CallOverhead.panama_blank_trivial     avgt   30   1.619 □ 0.003 ns/op
CallOverhead.panama_identity          avgt   30  11.412 □ 0.023 ns/op
CallOverhead.panama_identity_trivial  avgt   30   4.298 □ 0.008 ns/op

NO_INTRINSICS:
Benchmark                             Mode  Cnt    Score Error  Units
CallOverhead.jni_blank                avgt   30    7.963 □ 0.079  ns/op
CallOverhead.jni_identity             avgt   30   12.227 □ 0.027  ns/op
CallOverhead.panama_blank             avgt   30  193.799 □ 3.224  ns/op
CallOverhead.panama_identity          avgt   30  237.137 □ 1.150  ns/op

NO_SPEC:
Benchmark                             Mode  Cnt    Score Error  Units
CallOverhead.jni_blank                avgt   30    8.064 □ 0.117  ns/op
CallOverhead.jni_identity             avgt   30   12.381 □ 0.072  ns/op
CallOverhead.panama_blank             avgt   30  193.705 □ 2.275  ns/op
CallOverhead.panama_identity          avgt   30  292.271 □ 3.344  ns/op

The NO_SPEC benchmarks at the bottom are the status quo, the 
NO_INTRINSICS benchmarks only do the Java side specialization, but not 
the C2 specilization, and the benchmarks at the top are with everything 
enabled. I've also experimented with an attribute that can be added to 
FunctionDescriptor in case the function is small/trivial, which removes 
the thread state transition, which are the *_trivial results. Note that 
most native functions do not qualify for turning off thread state 
transitions, so this is mostly to show the very minor difference (only 
6-7ns) in call overhead in case our target function is trivial.

For integrating this, I will probably split this work into 3 patches to 
make reviewing easier:
1. the CallOverhead benchmark
2. the Java side specialization
3. the C2 support

Cheers,
Jorn

[1] : 
https://github.com/openjdk/panama-foreign/compare/foreign-abi...JornVernee:Call_Intrinsics#diff-5234454e5c0aa31251dd12fbd3a10319