status of VM long loop optimizations - call for action

Thu Dec 16 16:35:33 UTC 2021

On 16/12/2021 16:24, Rado Smogura wrote:
>
> Hi,
>
> I don't know details of underlying ABI, however I think there should 
> not be need to allocate additional structs.
>
> For POSIX read we pass 3 arguments, which should fit registers, and 
> all are 32/64 bit values
>
> (long)mh$.invokeExact(__fd, __buf, __nbytes)
>
> This happens in both cases where buf is MemorySegment and 
> MemoryAdderss, rest are primitives.
>
What is the signature of the native function? Is the argument 
corresponding to __buf a struct or a pointer?

Maurizio

> Kind regards,
>
> Rado
>
> On 16.12.2021 12:12, Maurizio Cimadamore wrote:
>>
>> On 13/12/2021 22:10, Maurizio Cimadamore wrote:
>>> That's odd - I mean, the BindingContext is used when setting up 
>>> downcall method handles, or upcall stubs. But should not be invoked 
>>> in the hot path. 
>>
>> Correction: the ofAllocator call you see might in fact even be in a 
>> hot path. A downcall method handle sometimes has to allocator memory 
>> for the temp buffers it uses. When that happens, the invocation is 
>> wrapped with a try-with-resources (well a MH chain equivalent to that 
>> is generated) and a new "binding context" with a SegmentAllocator is 
>> created. This should happen only when structs that are too big are 
>> passed by referenced by the ABI (I think that happens on Windows) - 
>> so we have to create a temp segment holding the struct, and pass the 
>> segment pointer to the underlying native function. The temp struct is 
>> then destroyed after the call.
>>
>> Upcalls also need an allocator, in case they receive structs by 
>> values (again, a temp segment might need to be allocated for the 
>> duration of the upcall).
>>
>> So, even if your downcall is fully intrinsified, you might still see 
>> calls to BindingContext::ofAllocator, depending on the shape of the 
>> called function. It is possible that C2 might have issue in 
>> scalarizing the Binding.Context allocation - but that's a separate 
>> problem from the one we were discussing (the impact of long loop 
>> optimizations).
>>
>> On that topic, I see that Roland has submitted a PR for the remaining 
>> perf issue we have seen in our micro benchmarks:
>>
>> https://github.com/openjdk/jdk18/pull/35
>>
>> I expect that, once integrated, we should then have full performance 
>> parity with current workarounds.
>>
>> Maurizio
>>