[foreign-memaccess+abi] RFR: Performance improvement to unchecked segment ofNativeRestricted [v3]

Maurizio Cimadamore mcimadamore at openjdk.java.net
Sat Jan 16 15:34:06 UTC 2021


On Sat, 16 Jan 2021 02:08:55 GMT, Radoslaw Smogura <github.com+7535718+rsmogura at openjdk.org> wrote:

>> This changes removes (by making no-ops) range and temporal checks for `ofNativeRestricted` segment. As this segment is global, above checks are not needed.
>> 
>> Generated native code is smaller, and execution outperforms Java native arrays (depending on CPU)
>> Changed
>> Benchmark                           Mode  Cnt          Score        Error  Units
>> AccessBenchmark.foreignAddress     thrpt    5  128946129.691 ± 317433.113  ops/s
>> AccessBenchmark.foreignAddressRaw  thrpt    5  136883439.221 ± 749390.255  ops/s
>> AccessBenchmark.target             thrpt    5  125325586.957 ±  32129.931  ops/s
>> Base
>> Benchmark                           Mode  Cnt          Score        Error  Units
>> AccessBenchmark.foreignAddress     thrpt    5  125257424.876 ± 230508.169  ops/s
>> AccessBenchmark.foreignAddressRaw  thrpt    5  128818591.434 ± 241806.765  ops/s
>> AccessBenchmark.target             thrpt    5  125083379.819 ± 184070.467  ops/s
>> ---
>> This PR is replacement for https://github.com/openjdk/panama-foreign/pull/431 (OCA)
>> and was partially discussed (before changes) in https://mail.openjdk.java.net/pipermail/panama-dev/2021-January/011747.htm
>> 
>> ---
>> Benchmark
>> @State(Scope.Thread)
>> public class AccessBenchmark {
>>     static final MemorySegment ms = MemorySegment.ofNativeRestricted();
>>     static final VarHandle intHandle = MemoryHandles.varHandle(int.class, ByteOrder.nativeOrder());
>> 
>>     int[] intData = new int[12];
>>     volatile int intDataOffset = 0;
>> 
>>     volatile MemoryAddress address;
>>     volatile long addressRaw;
>> 
>>     @Setup
>>     public void setup() {
>>         var ms = MemorySegment.allocateNative(256);
>>         address = ms.address();
>>         addressRaw = address.toRawLongValue();
>>     }
>> 
>>     @Benchmark
>>     public void target(Blackhole bh) {
>>         int[] local = intData;
>>         int localOffset = intDataOffset;
>>         bh.consume(local[localOffset]);
>>         bh.consume(local[localOffset + 1]);
>>     }
>> 
>>     @Benchmark
>>     public void foreignAddress(Blackhole bh) {
>>         var a = address;
>>         bh.consume((int) intHandle.get(ms, a.addOffset(0).toRawLongValue()));
>>         bh.consume((int) intHandle.get(ms, a.addOffset(4).toRawLongValue()));
>>     }
>> 
>>     @Benchmark
>>     public void foreignAddressRaw(Blackhole bh) {
>>         var a = addressRaw;
>>         bh.consume((int) intHandle.get(ms, a));
>>         bh.consume((int) intHandle.get(ms, a + 4));
>>     }
>> }
>
> Radoslaw Smogura has updated the pull request incrementally with one additional commit since the last revision:
> 
>   JMH Benchmarks for evaluation of `ofNativeRestricted`
>   
>   Original benchmark comparing performance of accessing
>   data using var handles vs ordinal arrays
>   
>   Modified existing benchmark `LoopOverNonConstant` to
>   see differences versus range / temporal checking & and non-checking segments.
>   
>   ```
>   Benchmark                                  Mode  Cnt  Score    Error  Units
>   LoopOverNonConstant.BB_get                 avgt   30  3.885 ?  0.003  ns/op
>   LoopOverNonConstant.BB_loop                avgt   30  0.229 ?  0.001  ms/op
>   LoopOverNonConstant.global_segment_get     avgt   30  3.663 ?  0.006  ns/op
>   LoopOverNonConstant.global_segment_loop    avgt   30  0.374 ?  0.001  ms/op
>   LoopOverNonConstant.segment_get            avgt   30  5.514 ?  0.023  ns/op
>   LoopOverNonConstant.segment_loop           avgt   30  0.229 ?  0.001  ms/op
>   ```
>   Not optimized `ofNativeRestricted`
>   ```
>   LoopOverNonConstant.global_segment_get     avgt   30  4.126 ?  0.006  ns/op
>   LoopOverNonConstant.global_segment_loop    avgt   30  0.603 ?  0.001  ms/op
>   ```

test/micro/org/openjdk/bench/jdk/incubator/foreign/LoopOverNonConstant.java line 65:

> 63:     static final MemorySegment globalRestrictedSegment = MemorySegment.ofNativeRestricted();
> 64: 
> 65:     static final VarHandle VH_int = MemoryLayout.ofSequence(JAVA_INT).varHandle(int.class, sequenceElement());

This is a strided VarHandle - e.g. it takes a logical index (the sequence element index inside the segment) and it dereferences that element. So, you use it like this:

VH_int.get(segment, 0); // first int element
VH_int.get(segment, 1); // second int element (segment base address + 4)
VH_int.get(segment, 2); // second int element (segment base address + 8)
...

test/micro/org/openjdk/bench/jdk/incubator/foreign/LoopOverNonConstant.java line 140:

> 138:         int res = 0;
> 139:         for (int i = 0; i < ELEM_SIZE; i ++) {
> 140:             res += (int) VH_int.get(globalRestrictedSegment, segment_addr_idx + i);

This looks wrong. you are passing an absolute address to a "logical index" argument.  I see that you are attempting to divide the segment address by the carrier size, and that kind offset things, but still leaves you with suboptimal performances.

-------------

PR: https://git.openjdk.java.net/panama-foreign/pull/437


More information about the panama-dev mailing list