[foreign-abi] Intrinsify down calls
Jorn Vernee
jorn.vernee at oracle.com
Wed Mar 25 16:08:35 UTC 2020
Hi,
I have done some work looking at intrinsification to speed up calls.
For down calls there are 2 areas that can be improved:
- Instead of interpreting a binding recipe for a call, we can use
MethodHandle combinators to create a specialized MethodHandle for
executing the steps of a binding recipe.
- When inlining a native MethodHandle, C2 can instead emit a direct call
to the target function, instead of using an intermediate buffer to store
the arguments (but borrowing some of the information it has on input and
output registers).
I have an experimental implementation of this uploaded here:
https://github.com/openjdk/panama-foreign/compare/foreign-abi...JornVernee:Call_Intrinsics
This is based on the ideas of linkToNative, but is mostly a re-write,
since linkToNative did not support ABI customization. The new
implementation also uses a fallback MethodHandle as the fallback
implementation until C2 kicks in, rather than generating a specialized
stub eagerly.
Some numbers from the newly added CallOverhead benchmark [1]:
Benchmark Mode Cnt Score Error Units
CallOverhead.jni_blank avgt 30 8.062 □ 0.153 ns/op
CallOverhead.jni_identity avgt 30 12.360 □ 0.050 ns/op
CallOverhead.panama_blank avgt 30 7.557 □ 0.025 ns/op
CallOverhead.panama_blank_trivial avgt 30 1.619 □ 0.003 ns/op
CallOverhead.panama_identity avgt 30 11.412 □ 0.023 ns/op
CallOverhead.panama_identity_trivial avgt 30 4.298 □ 0.008 ns/op
NO_INTRINSICS:
Benchmark Mode Cnt Score Error Units
CallOverhead.jni_blank avgt 30 7.963 □ 0.079 ns/op
CallOverhead.jni_identity avgt 30 12.227 □ 0.027 ns/op
CallOverhead.panama_blank avgt 30 193.799 □ 3.224 ns/op
CallOverhead.panama_identity avgt 30 237.137 □ 1.150 ns/op
NO_SPEC:
Benchmark Mode Cnt Score Error Units
CallOverhead.jni_blank avgt 30 8.064 □ 0.117 ns/op
CallOverhead.jni_identity avgt 30 12.381 □ 0.072 ns/op
CallOverhead.panama_blank avgt 30 193.705 □ 2.275 ns/op
CallOverhead.panama_identity avgt 30 292.271 □ 3.344 ns/op
The NO_SPEC benchmarks at the bottom are the status quo, the
NO_INTRINSICS benchmarks only do the Java side specialization, but not
the C2 specilization, and the benchmarks at the top are with everything
enabled. I've also experimented with an attribute that can be added to
FunctionDescriptor in case the function is small/trivial, which removes
the thread state transition, which are the *_trivial results. Note that
most native functions do not qualify for turning off thread state
transitions, so this is mostly to show the very minor difference (only
6-7ns) in call overhead in case our target function is trivial.
For integrating this, I will probably split this work into 3 patches to
make reviewing easier:
1. the CallOverhead benchmark
2. the Java side specialization
3. the C2 support
Cheers,
Jorn
[1] :
https://github.com/openjdk/panama-foreign/compare/foreign-abi...JornVernee:Call_Intrinsics#diff-5234454e5c0aa31251dd12fbd3a10319
More information about the panama-dev
mailing list