[foreign-memaccess+abi] RFR: Performance improvement to unchecked segment ofNativeRestricted [v4]

Wed Jan 20 14:23:52 UTC 2021

On Sat, 16 Jan 2021 21:00:04 GMT, Radoslaw Smogura <github.com+7535718+rsmogura at openjdk.org> wrote:

>> This changes removes (by making no-ops) range and temporal checks for `ofNativeRestricted` segment. As this segment is global, above checks are not needed.
>> 
>> Generated native code is smaller, and execution outperforms Java native arrays (depending on CPU)
>> Changed
>> Benchmark                           Mode  Cnt          Score        Error  Units
>> AccessBenchmark.foreignAddress     thrpt    5  128946129.691 ± 317433.113  ops/s
>> AccessBenchmark.foreignAddressRaw  thrpt    5  136883439.221 ± 749390.255  ops/s
>> AccessBenchmark.target             thrpt    5  125325586.957 ±  32129.931  ops/s
>> Base
>> Benchmark                           Mode  Cnt          Score        Error  Units
>> AccessBenchmark.foreignAddress     thrpt    5  125257424.876 ± 230508.169  ops/s
>> AccessBenchmark.foreignAddressRaw  thrpt    5  128818591.434 ± 241806.765  ops/s
>> AccessBenchmark.target             thrpt    5  125083379.819 ± 184070.467  ops/s
>> ---
>> This PR is replacement for https://github.com/openjdk/panama-foreign/pull/431 (OCA)
>> and was partially discussed (before changes) in https://mail.openjdk.java.net/pipermail/panama-dev/2021-January/011747.htm
>> 
>> ---
>> Benchmark
>> @State(Scope.Thread)
>> public class AccessBenchmark {
>>     static final MemorySegment ms = MemorySegment.ofNativeRestricted();
>>     static final VarHandle intHandle = MemoryHandles.varHandle(int.class, ByteOrder.nativeOrder());
>> 
>>     int[] intData = new int[12];
>>     volatile int intDataOffset = 0;
>> 
>>     volatile MemoryAddress address;
>>     volatile long addressRaw;
>> 
>>     @Setup
>>     public void setup() {
>>         var ms = MemorySegment.allocateNative(256);
>>         address = ms.address();
>>         addressRaw = address.toRawLongValue();
>>     }
>> 
>>     @Benchmark
>>     public void target(Blackhole bh) {
>>         int[] local = intData;
>>         int localOffset = intDataOffset;
>>         bh.consume(local[localOffset]);
>>         bh.consume(local[localOffset + 1]);
>>     }
>> 
>>     @Benchmark
>>     public void foreignAddress(Blackhole bh) {
>>         var a = address;
>>         bh.consume((int) intHandle.get(ms, a.addOffset(0).toRawLongValue()));
>>         bh.consume((int) intHandle.get(ms, a.addOffset(4).toRawLongValue()));
>>     }
>> 
>>     @Benchmark
>>     public void foreignAddressRaw(Blackhole bh) {
>>         var a = addressRaw;
>>         bh.consume((int) intHandle.get(ms, a));
>>         bh.consume((int) intHandle.get(ms, a + 4));
>>     }
>> }
>
> Radoslaw Smogura has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Replaced the stride access with normal VarHandle.
>   
>   Added no_align benchmakr, to compare preformance with alignments checks turned off.
>   ```
>   Benchmark                                         Mode  Cnt  Score   Error  Units
>   LoopOverNonConstant.BB_get                        avgt   30  3.892 ? 0.012  ns/op
>   LoopOverNonConstant.BB_loop                       avgt   30  0.230 ? 0.001  ms/op
>   LoopOverNonConstant.global_segment_get            avgt   30  3.887 ? 0.008  ns/op
>   LoopOverNonConstant.global_segment_loop           avgt   30  0.396 ? 0.002  ms/op
>   LoopOverNonConstant.global_segment_loop_no_align  avgt   30  0.247 ? 0.001  ms/op
>   LoopOverNonConstant.segment_get                   avgt   30  5.489 ? 0.014  ns/op
>   LoopOverNonConstant.segment_loop                  avgt   30  0.229 ? 0.001  ms/op
>   LoopOverNonConstant.segment_loop_readonly         avgt   30  0.236 ? 0.001  ms/op
>   LoopOverNonConstant.segment_loop_slice            avgt   30  0.241 ? 0.001  ms/op
>   LoopOverNonConstant.segment_loop_static           avgt   30  0.230 ? 0.001  ms/op
>   LoopOverNonConstant.unsafe_get                    avgt   30  3.425 ? 0.006  ns/op
>   LoopOverNonConstant.unsafe_loop                   avgt   30  0.230 ? 0.001  ms/op
>   ```
>   Not optimized `ofNativeRestricted`
>   ```
>   LoopOverNonConstant.global_segment_get     avgt   30  4.126 ?  0.006  ns/op
>   LoopOverNonConstant.global_segment_loop    avgt   30  0.603 ?  0.001  ms/op
>   ```

Looks good for now - we can reassess after the hotspot improvements for long in loops start to have visible effects. Thanks!

-------------

Marked as reviewed by mcimadamore (Committer).

PR: https://git.openjdk.java.net/panama-foreign/pull/437