[foreign-abi] Intrinsify down calls
Jorn Vernee
jorn.vernee at oracle.com
Mon Apr 6 13:35:16 UTC 2020
Hi Vladimir, thanks for taking a look.
On 06/04/2020 13:28, Vladimir Ivanov wrote:
> Impressive progress, Jorn!
>
> I like how GraphKit::make_native_call() et al shapes and encapsulates
> NativeEntryPoint-related logic.
>
>
> I have some comments on native call representation in C2.
>
> src/hotspot/cpu/x86/x86_64.ad:
>
> +int MachCallNativeNode::ret_addr_offset() {
> + // FIXME return size of emitted code? What to do here?
> + return 0;
> +}
>
> MachCallNode::ret_addr_offset() is used to determine proper
> instruction address for a safepoint and it usually points to the
> instruction right after the call instruction.
>
> It's not your case though: JNI calls go through a native stub which
> performs both thread state transition and calls into native entry point.
> The stub itself it treated specially, so there's no need to record
> proper address for them, but you have to. Otherwise, JVM will have
> problems finding relevant safepoint information.
>
> It leads to the following suggestion: what do you think about inlining
> the stub only when state transtition is omitted? When it's not, a
> special stub is generated and used (which obeys the call convention as
> much as possible). In both cases, you'll end up with a single call
> instruction in the generated code, but in the former case it calls
> directly into the native code while in the latter case it goes through
> the relevant stub.
I noticed with debugging that the return value of ret_addr_offset has to
correspond to the last_pc argument passed to set_last_Java_frame for the
safepoints to be handled correctly. Setting it to 0 worked in my
experiments, but maybe it will break for more complex code.
I like your suggestion but it brings up some uncertainty about how to
manage the lifetime of the generated stub. I.e. can it be somehow
attached to the compiled code? Can we cache/share these stubs easily (if
the call shape is the same)?
>
> Another important question is how register conflicts between VM and
> native code is handled. For example, when an argument/return value
> occupies the register which VM uses for its own purposes (r12/r15 on
> x86_64).
Yes, this is still an open problem. I was hoping to eventually solve
that using the register allocator (letting it generate the needed
spill/fill code), but another problem is that a lot of macro assembler
code assumes that it can kill registers that might be needed by the
target ABI. This has to be looked at further, but it is currently not a
problem for the ABIs we have. I'm thinking that if we haven't solved the
problem by the time an ABI is added that uses conflicting registers we
can bail out of compilation and keep running in the recipe interpreter
(fallback) instead.
>
> Some minor comments:
>
> src/hotspot/cpu/x86/x86_64.ad:
>
> +// Unpack native results
> +witch (_return_type) {
>
> It's better to reflect "unpacking" in the IR (as a separate pure node)
> than to hard-code it into the call logic.
Thanks. This was taken from the JNI stub generation code IIRC. I guess
I'd need to add such a node? Or does one exist already?
>
>
> src/hotspot/share/opto/output.cpp:
>
> - if (mcall->is_MachCallLeaf()) {
> + if (mcall->is_MachCallLeaf() || (mcall->is_MachCallNative()
> + && !(mcall->as_MachCallNative()->_need_transition))) {
> + // skip observing safepoint below (needs JVMS)
>
> Probably, it makes sense to capture CallNode::guaranteed_safepoint()
> during matching and use it instead here.
Ok, will do.
Thanks,
Jorn
>
> Best regards,
> Vladimir Ivanov
>
> [1]
>
> On 25.03.2020 19:08, Jorn Vernee wrote:
>> Hi,
>>
>> I have done some work looking at intrinsification to speed up calls.
>>
>> For down calls there are 2 areas that can be improved:
>> - Instead of interpreting a binding recipe for a call, we can use
>> MethodHandle combinators to create a specialized MethodHandle for
>> executing the steps of a binding recipe.
>> - When inlining a native MethodHandle, C2 can instead emit a direct
>> call to the target function, instead of using an intermediate buffer
>> to store the arguments (but borrowing some of the information it has
>> on input and output registers).
>>
>> I have an experimental implementation of this uploaded here:
>> https://github.com/openjdk/panama-foreign/compare/foreign-abi...JornVernee:Call_Intrinsics
>>
>>
>> This is based on the ideas of linkToNative, but is mostly a re-write,
>> since linkToNative did not support ABI customization. The new
>> implementation also uses a fallback MethodHandle as the fallback
>> implementation until C2 kicks in, rather than generating a
>> specialized stub eagerly.
>>
>> Some numbers from the newly added CallOverhead benchmark [1]:
>>
>> Benchmark Mode Cnt Score Error Units
>> CallOverhead.jni_blank avgt 30 8.062 □ 0.153 ns/op
>> CallOverhead.jni_identity avgt 30 12.360 □ 0.050 ns/op
>> CallOverhead.panama_blank avgt 30 7.557 □ 0.025 ns/op
>> CallOverhead.panama_blank_trivial avgt 30 1.619 □ 0.003 ns/op
>> CallOverhead.panama_identity avgt 30 11.412 □ 0.023 ns/op
>> CallOverhead.panama_identity_trivial avgt 30 4.298 □ 0.008 ns/op
>>
>> NO_INTRINSICS:
>> Benchmark Mode Cnt Score Error Units
>> CallOverhead.jni_blank avgt 30 7.963 □ 0.079 ns/op
>> CallOverhead.jni_identity avgt 30 12.227 □ 0.027 ns/op
>> CallOverhead.panama_blank avgt 30 193.799 □ 3.224 ns/op
>> CallOverhead.panama_identity avgt 30 237.137 □ 1.150 ns/op
>>
>> NO_SPEC:
>> Benchmark Mode Cnt Score Error Units
>> CallOverhead.jni_blank avgt 30 8.064 □ 0.117 ns/op
>> CallOverhead.jni_identity avgt 30 12.381 □ 0.072 ns/op
>> CallOverhead.panama_blank avgt 30 193.705 □ 2.275 ns/op
>> CallOverhead.panama_identity avgt 30 292.271 □ 3.344 ns/op
>>
>> The NO_SPEC benchmarks at the bottom are the status quo, the
>> NO_INTRINSICS benchmarks only do the Java side specialization, but
>> not the C2 specilization, and the benchmarks at the top are with
>> everything enabled. I've also experimented with an attribute that can
>> be added to FunctionDescriptor in case the function is small/trivial,
>> which removes the thread state transition, which are the *_trivial
>> results. Note that most native functions do not qualify for turning
>> off thread state transitions, so this is mostly to show the very
>> minor difference (only 6-7ns) in call overhead in case our target
>> function is trivial.
>>
>> For integrating this, I will probably split this work into 3 patches
>> to make reviewing easier:
>> 1. the CallOverhead benchmark
>> 2. the Java side specialization
>> 3. the C2 support
>>
>> Cheers,
>> Jorn
>>
>> [1] :
>> https://github.com/openjdk/panama-foreign/compare/foreign-abi...JornVernee:Call_Intrinsics#diff-5234454e5c0aa31251dd12fbd3a10319
>>
>>
More information about the panama-dev
mailing list