status of VM long loop optimizations - call for action

Thu Dec 16 16:45:07 UTC 2021

Hi,

Here's signature from man pages ssize_t read(int fd, void *buf, size_t 
count);

And one generated by jextract public static long read ( int __fd, 
Addressable __buf, long __nbytes)

The buf is from Polled allocator, so it's previously allocated memory 
segment. I tired with buf passed as MemorySegment and MemoryAddress.

Kind regards,

Rado

On 16.12.2021 17:35, Maurizio Cimadamore wrote:
>
>
> On 16/12/2021 16:24, Rado Smogura wrote:
>>
>> Hi,
>>
>> I don't know details of underlying ABI, however I think there should 
>> not be need to allocate additional structs.
>>
>> For POSIX read we pass 3 arguments, which should fit registers, and 
>> all are 32/64 bit values
>>
>> (long)mh$.invokeExact(__fd, __buf, __nbytes)
>>
>> This happens in both cases where buf is MemorySegment and 
>> MemoryAdderss, rest are primitives.
>>
> What is the signature of the native function? Is the argument 
> corresponding to __buf a struct or a pointer?
>
> Maurizio
>
>> Kind regards,
>>
>> Rado
>>
>> On 16.12.2021 12:12, Maurizio Cimadamore wrote:
>>>
>>> On 13/12/2021 22:10, Maurizio Cimadamore wrote:
>>>> That's odd - I mean, the BindingContext is used when setting up 
>>>> downcall method handles, or upcall stubs. But should not be invoked 
>>>> in the hot path. 
>>>
>>> Correction: the ofAllocator call you see might in fact even be in a 
>>> hot path. A downcall method handle sometimes has to allocator memory 
>>> for the temp buffers it uses. When that happens, the invocation is 
>>> wrapped with a try-with-resources (well a MH chain equivalent to 
>>> that is generated) and a new "binding context" with a 
>>> SegmentAllocator is created. This should happen only when structs 
>>> that are too big are passed by referenced by the ABI (I think that 
>>> happens on Windows) - so we have to create a temp segment holding 
>>> the struct, and pass the segment pointer to the underlying native 
>>> function. The temp struct is then destroyed after the call.
>>>
>>> Upcalls also need an allocator, in case they receive structs by 
>>> values (again, a temp segment might need to be allocated for the 
>>> duration of the upcall).
>>>
>>> So, even if your downcall is fully intrinsified, you might still see 
>>> calls to BindingContext::ofAllocator, depending on the shape of the 
>>> called function. It is possible that C2 might have issue in 
>>> scalarizing the Binding.Context allocation - but that's a separate 
>>> problem from the one we were discussing (the impact of long loop 
>>> optimizations).
>>>
>>> On that topic, I see that Roland has submitted a PR for the 
>>> remaining perf issue we have seen in our micro benchmarks:
>>>
>>> https://github.com/openjdk/jdk18/pull/35
>>>
>>> I expect that, once integrated, we should then have full performance 
>>> parity with current workarounds.
>>>
>>> Maurizio
>>>