[foreign-memaccess+abi] RFR: Performance improvement to unchecked segment ofNativeRestricted [v2]

Radoslaw Smogura github.com+7535718+rsmogura at openjdk.java.net
Thu Jan 14 23:24:23 UTC 2021


> This changes removes (by making no-ops) range and temporal checks for `ofNativeRestricted` segment. As this segment is global, above checks are not needed.
> 
> Generated native code is smaller, and execution outperforms Java native arrays (depending on CPU)
> Changed
> Benchmark                           Mode  Cnt          Score        Error  Units
> AccessBenchmark.foreignAddress     thrpt    5  128946129.691 ± 317433.113  ops/s
> AccessBenchmark.foreignAddressRaw  thrpt    5  136883439.221 ± 749390.255  ops/s
> AccessBenchmark.target             thrpt    5  125325586.957 ±  32129.931  ops/s
> Base
> Benchmark                           Mode  Cnt          Score        Error  Units
> AccessBenchmark.foreignAddress     thrpt    5  125257424.876 ± 230508.169  ops/s
> AccessBenchmark.foreignAddressRaw  thrpt    5  128818591.434 ± 241806.765  ops/s
> AccessBenchmark.target             thrpt    5  125083379.819 ± 184070.467  ops/s
> ---
> This PR is replacement for https://github.com/openjdk/panama-foreign/pull/431 (OCA)
> and was partially discussed (before changes) in https://mail.openjdk.java.net/pipermail/panama-dev/2021-January/011747.htm
> 
> ---
> Benchmark
> @State(Scope.Thread)
> public class AccessBenchmark {
>     static final MemorySegment ms = MemorySegment.ofNativeRestricted();
>     static final VarHandle intHandle = MemoryHandles.varHandle(int.class, ByteOrder.nativeOrder());
> 
>     int[] intData = new int[12];
>     volatile int intDataOffset = 0;
> 
>     volatile MemoryAddress address;
>     volatile long addressRaw;
> 
>     @Setup
>     public void setup() {
>         var ms = MemorySegment.allocateNative(256);
>         address = ms.address();
>         addressRaw = address.toRawLongValue();
>     }
> 
>     @Benchmark
>     public void target(Blackhole bh) {
>         int[] local = intData;
>         int localOffset = intDataOffset;
>         bh.consume(local[localOffset]);
>         bh.consume(local[localOffset + 1]);
>     }
> 
>     @Benchmark
>     public void foreignAddress(Blackhole bh) {
>         var a = address;
>         bh.consume((int) intHandle.get(ms, a.addOffset(0).toRawLongValue()));
>         bh.consume((int) intHandle.get(ms, a.addOffset(4).toRawLongValue()));
>     }
> 
>     @Benchmark
>     public void foreignAddressRaw(Blackhole bh) {
>         var a = addressRaw;
>         bh.consume((int) intHandle.get(ms, a));
>         bh.consume((int) intHandle.get(ms, a + 4));
>     }
> }

Radoslaw Smogura has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains four new commits since the last revision:

 - Small naming & comments improvements.
 - Revert "Next iteration of tuning"
   
   This change was introduced as it was found that JVM makes
   null check and inlines empty method. However right now
   this phenomen can't be see, so reverting this as
   it can generate number of NPE.
   
   This reverts commit 9e29818b8a2f4ba3a3bec8a1edace072c993ccd4.
 - Next iteration of tuning
   
   After checking source code it looks like that better is to set scope to `null`.
   
   The results outpaced the Java array access.
   
   ```
   Benchmark                           Mode  Cnt         Score          Error  Units
   AccessBenchmark.foreignAddress     thrpt    4  86860188.499 ± 13454393.406  ops/s
   AccessBenchmark.foreignAddressRaw  thrpt    4  96150181.668 ±  7025145.700  ops/s
   AccessBenchmark.target             thrpt    4  93673099.539 ± 23272596.145  ops/s```
   
   versus tests on original repo
   ```
   Benchmark                           Mode  Cnt         Score         Error  Units
   AccessBenchmark.foreignAddress     thrpt    4  81907199.092 ± 2663269.652  ops/s
   AccessBenchmark.foreignAddressRaw  thrpt    4  83629168.611 ± 1025857.535  ops/s
   AccessBenchmark.target             thrpt    4  94023553.582 ± 6128411.421  ops/s
   ```
   
   # Benchmark code
   ```
   State(Scope.Thread)
   public class AccessBenchmark {
       static final MemorySegment ms = MemorySegment.ofNativeRestricted();
       static final VarHandle intHandle = MemoryHandles.varHandle(int.class, ByteOrder.nativeOrder());
   
       int[] intData = new int[12];
       volatile int intDataOffset = 0;
   
       volatile MemoryAddress address;
       volatile long addressRaw;
   
       @Setup
       public void setup() {
           var ms = MemorySegment.allocateNative(256);
           address = ms.address();
           addressRaw = address.toRawLongValue();
       }
   
       @Benchmark
       public void target(Blackhole bh) {
           int[] local = intData;
           int localOffset = intDataOffset;
           bh.consume(local[localOffset]);
           bh.consume(local[localOffset + 1]);
       }
   
       @Benchmark
       public void foreignAddress(Blackhole bh) {
           var a = address;
           bh.consume((int) intHandle.get(ms, a.addOffset(0).toRawLongValue()));
           bh.consume((int) intHandle.get(ms, a.addOffset(4).toRawLongValue()));
       }
   
       @Benchmark
       public void foreignAddressRaw(Blackhole bh) {
           var a = addressRaw;
           bh.consume((int) intHandle.get(ms, a));
           bh.consume((int) intHandle.get(ms, a + 4));
       }
   }
   
   ```
 - [WIP] Performance improvement to unchecked segment  ofNativeRestricted
   
   Accessing native memory using ofNativeRestricted could generate range and temporal checkes. As this scope can't be closed and represents whole memory, above checks are not needed, and are |leftoevers| from  NativeMemorySegmentImpl.
   
   Thus to overcome this, I adding special segment & scope to allow hotspot better optimize code would be a good solution.
   
   The JMH benchmarks baselined to peformance of plain array access, shown improvement from 89% of array access to 94% of it (% = foreignAddress / target)
   
   Improved version
   ```
   Benchmark                        Mode  Cnt         Score          Error  Units
   AccessBenchmark.foreignAddress  thrpt    4  87981021.113 ±  4496953.479  ops/s
   AccessBenchmark.target          thrpt    4  92840761.490 ± 15994108.441  ops/s
   ```
   
   Original version
   ```
   Benchmark                        Mode  Cnt         Score         Error  Units
   AccessBenchmark.foreignAddress  thrpt    4  82076915.820 ± 3076568.791  ops/s
   AccessBenchmark.target          thrpt    4  91962637.002 ± 5104697.571  ops/s
   ```

-------------

Changes:
  - all: https://git.openjdk.java.net/panama-foreign/pull/437/files
  - new: https://git.openjdk.java.net/panama-foreign/pull/437/files/98ad3a9c..c7d4fdf1

Webrevs:
 - full: https://webrevs.openjdk.java.net/?repo=panama-foreign&pr=437&range=01
 - incr: https://webrevs.openjdk.java.net/?repo=panama-foreign&pr=437&range=00-01

  Stats: 0 lines in 0 files changed: 0 ins; 0 del; 0 mod
  Patch: https://git.openjdk.java.net/panama-foreign/pull/437.diff
  Fetch: git fetch https://git.openjdk.java.net/panama-foreign pull/437/head:pull/437

PR: https://git.openjdk.java.net/panama-foreign/pull/437


More information about the panama-dev mailing list