[foreign-memaccess+abi] RFR: 8291826: Rework MemoryLayout Sealed Hierarchy [v5]

Mon Aug 22 11:28:02 UTC 2022

On Fri, 19 Aug 2022 17:19:59 GMT, Paul Sandoz <psandoz at openjdk.org> wrote:

>> Per Minborg has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Fix problems in static initializers
>
> src/java.base/share/classes/java/lang/foreign/MemorySegment.java line 1159:
> 
>> 1157:     @ForceInline
>> 1158:     default byte get(ValueLayout.OfByte layout, long offset) {
>> 1159:         return (byte) ((ValueLayouts.OfByteImpl) layout).accessHandle().get(this, offset);
> 
> That looks better. Let's measure using micros under test/micro/org/openjdk/bench/java/lang/foreign, search for a few that perform get/set via MemorySegment e.g. the LoopOver* benchmarks are a good candidate set.

I've made some benchmarks:

Main branch ("Baseline"):

Benchmark                                                (polluteProfile)  Mode  Cnt  Score   Error  Units
LoopOverNonConstant.segment_loop_instance                             N/A  avgt   30  0.300 ± 0.003  ms/op
LoopOverNonConstant.segment_loop_instance_index                       N/A  avgt   30  0.321 ± 0.005  ms/op
LoopOverNonConstant.segment_loop_instance_unaligned                   N/A  avgt   30  0.339 ± 0.005  ms/op
LoopOverNonConstantHeap.segment_loop_instance                       false  avgt   30  0.243 ± 0.004  ms/op
LoopOverNonConstantHeap.segment_loop_instance                        true  avgt   30  0.255 ± 0.010  ms/op
LoopOverNonConstantHeap.segment_loop_instance_unaligned             false  avgt   30  0.251 ± 0.011  ms/op
LoopOverNonConstantHeap.segment_loop_instance_unaligned              true  avgt   30  0.254 ± 0.009  ms/op
LoopOverNonConstantMapped.segment_loop_instance                       N/A  avgt   30  0.266 ± 0.012  ms/op
LoopOverNonConstantShared.segment_loop_instance                       N/A  avgt   30  0.256 ± 0.011  ms/op
LoopOverNonConstantShared.segment_loop_instance_address               N/A  avgt   30  0.264 ± 0.010  ms/op

With the proposed solution ("Casting" to a specific `Of*Impl` class):

Benchmark                                                (polluteProfile)  Mode  Cnt  Score   Error  Units
LoopOverNonConstant.segment_loop_instance                             N/A  avgt   30  0.263 ±  0.006  ms/op
LoopOverNonConstant.segment_loop_instance_index                       N/A  avgt   30  0.281 ±  0.004  ms/op
LoopOverNonConstant.segment_loop_instance_unaligned                   N/A  avgt   30  0.286 ±  0.015  ms/op
LoopOverNonConstantHeap.segment_loop_instance                       false  avgt   30  0.282 ±  0.003  ms/op
LoopOverNonConstantHeap.segment_loop_instance                        true  avgt   30  0.272 ±  0.008  ms/op
LoopOverNonConstantHeap.segment_loop_instance_unaligned             false  avgt   30  0.279 ±  0.003  ms/op
LoopOverNonConstantHeap.segment_loop_instance_unaligned              true  avgt   30  0.274 ±  0.009  ms/op
LoopOverNonConstantMapped.segment_loop_instance                       N/A  avgt   30  0.252 ±  0.011  ms/op
LoopOverNonConstantShared.segment_loop_instance                       N/A  avgt   30  0.267 ±  0.009  ms/op
LoopOverNonConstantShared.segment_loop_instance_address               N/A  avgt   30  0.261 ±  0.012  ms/op

This can be summarized in the following table (values are in ms/op):

Benchmark | Baseline | Casting
-- | -- | --
LONC.segment_loop_instance | 0.300 | 0.263
LONC.segment_loop_instance_index | 0.321 | 0.281
LONC.segment_loop_instance_unaligned | 0.339 | 0.286
LONCHeap.segment_loop_instance (non-polluted) | 0.243 | 0.282
LONCHeap.segment_loop_instance (polluted) | 0.255 | 0.272
LONCHeap.segment_loop_instance_unaligned (non-polluted) | 0.251 | 0.279
LONCHeap.segment_loop_instance_unaligned (polluted) | 0.254 | 0.274
LONCMapped.segment_loop_instance | 0.266 | 0.252
LONCShared.segment_loop_instance | 0.256 | 0.267
LONCShared.segment_loop_instance_address | 0.264 | 0.261

, and the following graph (also showing estimated error margins at around 0.01 ms):

![image](https://user-images.githubusercontent.com/7457876/185908494-42b04f68-ce11-463b-a375-0710e38a3607.png)

NOTE: The PR ("Casting") contains more than just the casting operations compared to the "Baseline" and the benchmarks were performed on a MacBook Pro (16-inch, 2019) with 2.3 GHz 8-Core Intel Core i9 and MacOS 12.5.1.

-------------

PR: https://git.openjdk.org/panama-foreign/pull/710