[foreign-memaccess+abi] RFR: Performance improvement to unchecked segment ofNativeRestricted [v2]

Mon Jan 18 10:59:50 UTC 2021

On Mon, 18 Jan 2021 10:54:42 GMT, Maurizio Cimadamore <mcimadamore at openjdk.org> wrote:

>>> > > Hi. I've added the benchmarks:
>>> > > ```
>>> > > * benchmarks used by me;
>>> > > 
>>> > > * added a new benchmarks to `LoopOverNonConstant` to compare performance of this approach with existing methods for getting values.
>>> > > ```
>>> > 
>>> > 
>>> > It's strange to see the performances of global segment to be worse than that of a regular segment?
>>> 
>>> Ah! I think I know what's happening here. In principle, the original implementation should have been just as good - but there's one issue: since the everything segment has length of Long.MAX_VALUE it is, by definition, not a SMALL segment - which means the internal optimizations that are aimed at working around the hotspot limitations around int vs. long loops will not work for the everything segment. That's why it was 2x slower. Your patch removes _some_ of these factors, but not _all_ of them. And I think the updates to LoopOverNonConstants are wrong - as you are essentially trying to use a strided VarHandle in an absolute way (see code comment).
>>> 
>>> Note that an indexed VarHandle will, internally, need to perform additions and multiplications - so if your segment is "big" these will be long additions and multiplications - which you will pay for (at least for the time being).
>>> 
>>> Overall I'm hopeful that, by the time we fix the long vs. int optimization problem in hotspot, the code we have today should just get faster, without a real need for a global segment.
>> 
>> Hi,
>> 
>> That's interesting how changing just one type can introduce such a big performance results.
>> 
>> I changed the access to `globalRestrictedSegment`. I completely agree previous version was a bit wired.
>> 
>> I analyzed both methods, it looks like that (I'm not the expert), segment / int versions do loop unrolling
>>   0x00007fbcd80c8220:   add    0x0(%r13),%eax                                                                                                                               
>>   0x00007fbcd80c8224:   add    0x4(%r13),%eax                                                                                                                               
>>   0x00007fbcd80c8228:   add    0x8(%r13),%eax                                                                                                                               
>> // Removed similar lines                                                                                                                           
>>   0x00007fbcd80c8254:   add    0x34(%r13),%eax                                                                                                                              
>>   0x00007fbcd80c8258:   add    0x38(%r13),%eax                                                                                                                              
>>   0x00007fbcd80c825c:   add    0x3c(%r13),%eax              ;*iadd {reexecute=0 rethrow=0 return_oop=0}                                                                     
>>                                                             ; - org.openjdk.bench.jdk.incubator.foreign.LoopOverNonConstant::segment_loop at 23 (line 165)                     
>>   0x00007fbcd80c8260:   add    $0x10,%ebx                   ;*iinc {reexecute=0 rethrow=0 return_oop=0}                                                                     
>>                                                             ; - org.openjdk.bench.jdk.incubator.foreign.LoopOverNonConstant::segment_loop at 25 (line 164) 
>> I could not find such optimization in case of global segment and long numbers. 
>> 
>> From the other hand I have put together this PR and previous with Unsafe access, disabled alignment checks by settings alignment to 1, and global memory segment rocks. I added benchmark method to show performance of it.
>
>>  could not find such optimization in case of global segment and long numbers.
> 
> The optimization happens on segment construction - if the segment size is < Integer.MAX_VALUE, the AbstractMemorySegment::defaultAccessMode method appends the SMALL flag to the set of segment flags. This flag is then consulted many times, typically before long addition/multiplications have to be performed.
>> 
>> From the other hand I have put together this PR and previous with Unsafe access, disabled alignment checks by settings alignment to 1, and global memory segment rocks. I added benchmark method to show performance of it.
> 
> Alignment check is an outstanding issue for the API - hotspot doesn't optimize those very well. See also:
> 
> https://mail.openjdk.java.net/pipermail/panama-dev/2021-January/011794.html

Would you mind pasting the benchmark results before/after the patch, now that the benchmark has been fixed? Thanks!

-------------

PR: https://git.openjdk.java.net/panama-foreign/pull/437