[foreign-abi] Intrinsify down calls

Wed Apr 29 11:43:10 UTC 2020

Hi Vladimir,

I've worked through your suggestions and fixed some problems with tests 
on Linux along the way. The latest version can be found here: 
https://github.com/openjdk/panama-foreign/compare/foreign-abi...JornVernee:Call_Intrinsics_Stubs

FWIW, most of the changes are  to NativeEntryPoint.java and the 
accompanying nativeEntryPoint*.cpp files.

There's still a problem though; attaching the generated stub for the 
state transition to NativeEntryPoint, and then using a Cleaner to free 
it isn't cutting it. The code heap is still sporadically running out of 
size during the tests. It seems like relying on GC is not enough here. 
Right now I'm relying on the fact that NULL is returned in the event of 
an allocation failure, calling System.gc() and then retrying the 
allocation once more. But, the code cache running out of space will also 
disable compilation it seems, which is the whole point of spinning the 
stub in the first place. For now relying in the GC seems like a dead end.

There might be some options for attaching the stub to the nmethod 
itself, and then freeing it when the nmethod is freed. But maybe for now 
we should go back to emitting the thread state transition inline, and 
then improve upon this later.

Jorn

On 06/04/2020 16:03, Vladimir Ivanov wrote:
>
>>> It leads to the following suggestion: what do you think about 
>>> inlining the stub only when state transtition is omitted? When it's 
>>> not, a special stub is generated and used (which obeys the call 
>>> convention as much as possible). In both cases, you'll end up with a 
>>> single call instruction in the generated code, but in the former 
>>> case it calls directly into the native code while in the latter case 
>>> it goes through the relevant stub.
>>
>> I noticed with debugging that the return value of ret_addr_offset has 
>> to correspond to the last_pc argument passed to set_last_Java_frame 
>> for the safepoints to be handled correctly. Setting it to 0 worked in 
>> my experiments, but maybe it will break for more complex code.
>>
>> I like your suggestion but it brings up some uncertainty about how to 
>> manage the lifetime of the generated stub. I.e. can it be somehow 
>> attached to the compiled code? Can we cache/share these stubs easily 
>> (if the call shape is the same)?
>
> I think NativeEntryPoint instance is a good candidate to manage the 
> stub: once it goes away it should be safe to collect the stub. You can 
> register a Cleaner which does that.
>
> Regarding sharing, there are definitely opportunities to share the 
> stub between the calls of the same shape. Maybe it's worth considering 
> additional entity which describes "the shape of the call" and 
> NativeEntryPoint becomes a composition of such an ABIShape and entry 
> point address.
>
>>> Another important question is how register conflicts between VM and 
>>> native code is handled. For example, when an argument/return value 
>>> occupies the register which VM uses for its own purposes (r12/r15 on 
>>> x86_64).
>> Yes, this is still an open problem. I was hoping to eventually solve 
>> that using the register allocator (letting it generate the needed 
>> spill/fill code), but another problem is that a lot of macro 
>> assembler code assumes that it can kill registers that might be 
>> needed by the target ABI. This has to be looked at further, but it is 
>> currently not a problem for the ABIs we have. I'm thinking that if we 
>> haven't solved the problem by the time an ABI is added that uses 
>> conflicting registers we can bail out of compilation and keep running 
>> in the recipe interpreter (fallback) instead.
>
> Yes, just avoiding the intrinsification is a good stop-the-gap solution.
>
>>> Some minor comments:
>>>
>>> src/hotspot/cpu/x86/x86_64.ad:
>>>
>>> +// Unpack native results
>>> +witch (_return_type) {
>>>
>>> It's better to reflect "unpacking" in the IR (as a separate pure 
>>> node) than to hard-code it into the call logic.
>> Thanks. This was taken from the JNI stub generation code IIRC. I 
>> guess I'd need to add such a node? Or does one exist already?
>
> There's no dedicated node for that, but you can just generate the 
> operations directly:
>
> 1) movzwl r r == (r & 0xFFFF) // andI_rReg_imm65535 in x86_64.ad
>
> 2) sign_extend_byte r == ((src << 24) >> 24) // i2b in x86_64.ad
>
> 3) sign_extend_word r == ((src << 16) >> 16) // i2s instruction in 
> x86_64.ad
>
>>> src/hotspot/share/opto/output.cpp:
>>>
>>> -          if (mcall->is_MachCallLeaf()) {
>>> +          if (mcall->is_MachCallLeaf() || (mcall->is_MachCallNative()
>>> +              && !(mcall->as_MachCallNative()->_need_transition))) {
>>> +              // skip observing safepoint below (needs JVMS)
>>>
>>> Probably, it makes sense to capture CallNode::guaranteed_safepoint() 
>>> during matching and use it instead here.
>>
>> Ok, will do.
>
> Forgot to post the link:
>
> http://hg.openjdk.java.net/jdk/jdk/file/tip/src/hotspot/share/opto/matcher.cpp#l1152 
>
>
> There's already some copying happening from CallNode to MachCallNode.
>
> Best regards,
> Vladimir Ivanov
>
>>> On 25.03.2020 19:08, Jorn Vernee wrote:
>>>> Hi,
>>>>
>>>> I have done some work looking at intrinsification to speed up calls.
>>>>
>>>> For down calls there are 2 areas that can be improved:
>>>> - Instead of interpreting a binding recipe for a call, we can use 
>>>> MethodHandle combinators to create a specialized MethodHandle for 
>>>> executing the steps of a binding recipe.
>>>> - When inlining a native MethodHandle, C2 can instead emit a direct 
>>>> call to the target function, instead of using an intermediate 
>>>> buffer to store the arguments (but borrowing some of the 
>>>> information it has on input and output registers).
>>>>
>>>> I have an experimental implementation of this uploaded here: 
>>>> https://github.com/openjdk/panama-foreign/compare/foreign-abi...JornVernee:Call_Intrinsics 
>>>>
>>>>
>>>> This is based on the ideas of linkToNative, but is mostly a 
>>>> re-write, since linkToNative did not support ABI customization. The 
>>>> new implementation also uses a fallback MethodHandle as the 
>>>> fallback implementation until C2 kicks in, rather than generating a 
>>>> specialized stub eagerly.
>>>>
>>>> Some numbers from the newly added CallOverhead benchmark [1]:
>>>>
>>>> Benchmark                             Mode  Cnt   Score Error Units
>>>> CallOverhead.jni_blank                avgt   30   8.062 □ 0.153 ns/op
>>>> CallOverhead.jni_identity             avgt   30  12.360 □ 0.050 ns/op
>>>> CallOverhead.panama_blank             avgt   30   7.557 □ 0.025 ns/op
>>>> CallOverhead.panama_blank_trivial     avgt   30   1.619 □ 0.003 ns/op
>>>> CallOverhead.panama_identity          avgt   30  11.412 □ 0.023 ns/op
>>>> CallOverhead.panama_identity_trivial  avgt   30   4.298 □ 0.008 ns/op
>>>>
>>>> NO_INTRINSICS:
>>>> Benchmark                             Mode  Cnt    Score Error Units
>>>> CallOverhead.jni_blank                avgt   30    7.963 □ 0.079  
>>>> ns/op
>>>> CallOverhead.jni_identity             avgt   30   12.227 □ 0.027  
>>>> ns/op
>>>> CallOverhead.panama_blank             avgt   30  193.799 □ 3.224  
>>>> ns/op
>>>> CallOverhead.panama_identity          avgt   30  237.137 □ 1.150  
>>>> ns/op
>>>>
>>>> NO_SPEC:
>>>> Benchmark                             Mode  Cnt    Score Error Units
>>>> CallOverhead.jni_blank                avgt   30    8.064 □ 0.117  
>>>> ns/op
>>>> CallOverhead.jni_identity             avgt   30   12.381 □ 0.072  
>>>> ns/op
>>>> CallOverhead.panama_blank             avgt   30  193.705 □ 2.275  
>>>> ns/op
>>>> CallOverhead.panama_identity          avgt   30  292.271 □ 3.344  
>>>> ns/op
>>>>
>>>> The NO_SPEC benchmarks at the bottom are the status quo, the 
>>>> NO_INTRINSICS benchmarks only do the Java side specialization, but 
>>>> not the C2 specilization, and the benchmarks at the top are with 
>>>> everything enabled. I've also experimented with an attribute that 
>>>> can be added to FunctionDescriptor in case the function is 
>>>> small/trivial, which removes the thread state transition, which are 
>>>> the *_trivial results. Note that most native functions do not 
>>>> qualify for turning off thread state transitions, so this is mostly 
>>>> to show the very minor difference (only 6-7ns) in call overhead in 
>>>> case our target function is trivial.
>>>>
>>>> For integrating this, I will probably split this work into 3 
>>>> patches to make reviewing easier:
>>>> 1. the CallOverhead benchmark
>>>> 2. the Java side specialization
>>>> 3. the C2 support
>>>>
>>>> Cheers,
>>>> Jorn
>>>>
>>>> [1] : 
>>>> https://github.com/openjdk/panama-foreign/compare/foreign-abi...JornVernee:Call_Intrinsics#diff-5234454e5c0aa31251dd12fbd3a10319 
>>>>
>>>>