RFR (14) 8235837: Memory access API refinements

Wed Jan 15 18:48:55 UTC 2020

Maybe this would be best moved on panama-dev?

In any case, for obtaining best performances, it is best to use an 
indexed (or strided) var handle - your loop will create a new memory 
address on each new iteration, which will not be a problem once 
MemoryAddress will be an inline type, but in the meantime...

We have some benchmarks here:

http://hg.openjdk.java.net/panama/dev/file/5249395528dc/test/micro/org/openjdk/bench/jdk/incubator/foreign

Your test seems similar to this:

http://hg.openjdk.java.net/panama/dev/file/5249395528dc/test/micro/org/openjdk/bench/jdk/incubator/foreign/LoopOverNew.java

In the panama repo this benchmark obtains same numbers as bytebuffer, 
and same loop unrolling (but the panama repo has one performance 
optimization that JDK 14 doesn't yet have, to workaround the lack of 
optimization with longs used in loops). This has been rectified with an 
implementation change which allows us to use ints instead of longs in 
bound checks, when the API can prove that the segment is small - that 
work is described in this thread:

https://mail.openjdk.java.net/pipermail/panama-dev/2020-January/007081.html

And the corresponding, longer term C2 fix is captured here:

https://bugs.openjdk.java.net/browse/JDK-8223051

That said, even w/o that performance fix, I wouldn't expect the memory 
access API to be 4x slower. I'd start by dropping the acquire() [which 
you probably don't need and it's doing a CAS], and moving to indexed var 
handle (by replicating the benchmark code linked above) and see if that 
works better.

Maurizio

On 15/01/2020 18:00, Andrew Haley wrote:
> On 1/9/20 4:37 PM, Maurizio Cimadamore wrote:
>> There you go
>>
>> cr.openjdk.java.net/~mcimadamore/8235837_javadoc
> Thank you.
>
> So I've been kicking the tyres, and I'm rather surprised at how poor
> the performance seems to be. My simple test, like this:
>
>      @Benchmark
>      public void intHandleTest(BenchmarkState state) {
>          try (var segment = BenchmarkState.segment.acquire()) {
>              var base = segment.baseAddress();
>              final var byteSize = ARRAY_SIZE * 4;
>              for (int i = 0; i < byteSize; i += 4) {
>                  BenchmarkState.intHandle.set(base.offset(i), (int) 4);
>              }
>          }
>      }
>
> has a great deal of overhead. It was a bit of a struggle to get it to
> unroll nicely, and the best I could get was
>
>    6.90%  │  0x00007faeeff7dec8:   mov    r9d,r11d
>           │  0x00007faeeff7decb:   add    r9d,0x4                      ;*iinc {reexecute=0 rethrow=0 return_oop=0}
>           │                                                            ; - org.sample.MemoryHandlesTest::intHandleTest at 45 (line 34)
>           │                                                            ; - org.sample.generated.MemoryHandlesTest_intHandleTest_jmhTest::intHandleTest_avgt_jmhStub at 17 (line 191)
>           │  0x00007faeeff7decf:   mov    rdx,rbx
>           │  0x00007faeeff7ded2:   add    rdx,0x10                     ;*i2l {reexecute=0 rethrow=0 return_oop=0}
>           │                                                            ; - org.sample.MemoryHandlesTest::intHandleTest at 35 (line 35)
>           │                                                            ; - org.sample.generated.MemoryHandlesTest_intHandleTest_jmhTest::intHandleTest_avgt_jmhStub at 17 (line 191)
>    0.06%  │  0x00007faeeff7ded6:   cmp    rdx,rdi
>           │  0x00007faeeff7ded9:   jg     0x00007faeeff7df94           ;*ifle {reexecute=0 rethrow=0 return_oop=0}
>           │                                                            ; - jdk.internal.foreign.MemorySegmentImpl::checkBounds at 20 (line 196)
>           │                                                            ; - jdk.internal.foreign.MemorySegmentImpl::checkRange at 29 (line 178)
>           │                                                            ; - jdk.internal.foreign.MemoryAddressImpl::checkAccess at 21 (line 84)
>           │                                                            ; - java.lang.invoke.VarHandleMemoryAddressAsInts::checkAddress at 15 (line 50)
>           │                                                            ; - java.lang.invoke.VarHandleMemoryAddressAsInts::set0 at 7 (line 85)
>           │                                                            ; - java.lang.invoke.VarHandleMemoryAddressAsInts0/0x0000000800bc3840::set at 7
>           │                                                            ; - java.lang.invoke.VarHandleGuards::guard_LI_V at 33 (line 114)
>           │                                                            ; - org.sample.MemoryHandlesTest::intHandleTest at 42 (line 35)
>           │                                                            ; - org.sample.generated.MemoryHandlesTest_intHandleTest_jmhTest::intHandleTest_avgt_jmhStub at 17 (line 191)
>           │  0x00007faeeff7dedf:   mov    DWORD PTR [rsi+0x10],0x4     ;*invokevirtual putIntUnaligned {reexecute=0 rethrow=0 return_oop=0}
>           │                                                            ; - jdk.internal.misc.Unsafe::putIntUnaligned at 10 (line 3693)
>           │                                                            ; - java.lang.invoke.VarHandleMemoryAddressAsInts::set0 at 38 (line 86)
>           │                                                            ; - java.lang.invoke.VarHandleMemoryAddressAsInts0/0x0000000800bc3840::set at 7
>           │                                                            ; - java.lang.invoke.VarHandleGuards::guard_LI_V at 33 (line 114)
>           │                                                            ; - org.sample.MemoryHandlesTest::intHandleTest at 42 (line 35)
>           │                                                            ; - org.sample.generated.MemoryHandlesTest_intHandleTest_jmhTest::intHandleTest_avgt_jmhStub at 17 (line 191)
>
> for every store. In contrast, similar ByteBuffer code looks like:
>
>
>    0.08%   ↗  0x00007f3b5bf717c0:   movsxd r13,r8d
>    0.16%   │  0x00007f3b5bf717c3:   mov    r14,rdx
>            │  0x00007f3b5bf717c6:   add    r14,r13
>    1.00%   │  0x00007f3b5bf717c9:   movsxd r13,r8d
>    0.04%   │  0x00007f3b5bf717cc:   vmovdqu YMMWORD PTR [rdx+r13*1],ymm4
>    6.87%   │  0x00007f3b5bf717d2:   vmovdqu YMMWORD PTR [r14+0x20],ymm4
>    5.77%   │  0x00007f3b5bf717d8:   vmovdqu YMMWORD PTR [r14+0x40],ymm4
>    3.99%   │  0x00007f3b5bf717de:   vmovdqu YMMWORD PTR [r14+0x60],ymm4
>    6.09%   │  0x00007f3b5bf717e4:   vmovdqu YMMWORD PTR [r14+0x80],ymm4
>    4.97%   │  0x00007f3b5bf717ed:   vmovdqu YMMWORD PTR [r14+0xa0],ymm4
>    4.93%   │  0x00007f3b5bf717f6:   vmovdqu YMMWORD PTR [r14+0xc0],ymm4
>    5.07%   │  0x00007f3b5bf717ff:   vmovdqu YMMWORD PTR [r14+0xe0],ymm4
>    4.87%   │  0x00007f3b5bf71808:   vmovdqu YMMWORD PTR [r14+0x100],ymm4
>    7.39%   │  0x00007f3b5bf71811:   vmovdqu YMMWORD PTR [r14+0x120],ymm4
>    5.19%   │  0x00007f3b5bf7181a:   vmovdqu YMMWORD PTR [r14+0x140],ymm4
>    6.21%   │  0x00007f3b5bf71823:   vmovdqu YMMWORD PTR [r14+0x160],ymm4
>    4.93%   │  0x00007f3b5bf7182c:   vmovdqu YMMWORD PTR [r14+0x180],ymm4
>    5.69%   │  0x00007f3b5bf71835:   vmovdqu YMMWORD PTR [r14+0x1a0],ymm4
>   11.28%   │  0x00007f3b5bf7183e:   vmovdqu YMMWORD PTR [r14+0x1c0],ymm4
>    4.83%   │  0x00007f3b5bf71847:   vmovdqu YMMWORD PTR [r14+0x1e0],ymm4;*invokevirtual putIntUnaligned {reexecute=0 rethrow=0 return_oop=0}
>            │                                                            ; - jdk.internal.misc.Unsafe::putIntUnaligned at 10 (line 3693)
>            │                                                            ; - java.nio.DirectByteBuffer::putInt at 18 (line 860)
>            │                                                            ; - java.nio.DirectByteBuffer::putInt at 12 (line 881)
>            │                                                            ; - org.sample.ByteBufferTest::floss at 15 (line 34)
>            │                                                            ; - org.sample.ByteBufferTest::test at 14 (line 42)
>            │                                                            ; - org.sample.generated.ByteBufferTest_test_jmhTest::test_avgt_jmhStub at 17 (line 241)
>    2.85%   │  0x00007f3b5bf71850:   add    r8d,0x200                    ;*iinc {reexecute=0 rethrow=0 return_oop=0}
>            │                                                            ; - org.sample.ByteBufferTest::floss at 19 (line 33)
>            │                                                            ; - org.sample.ByteBufferTest::test at 14 (line 42)
>            │                                                            ; - org.sample.generated.ByteBufferTest_test_jmhTest::test_avgt_jmhStub at 17 (line 241)
>            │  0x00007f3b5bf71857:   cmp    r8d,ecx
>            ╰  0x00007f3b5bf7185a:   jl     0x00007f3b5bf717c0           ;*goto {reexecute=0 rethrow=0 return_oop=0}
>
> nice, eh?
>
> Benchmark                             Mode  Cnt     Score       Error  Units
> ByteBufferTest.test                   avgt    5   620.628 ±     2.947  ns/op
> MemoryHandlesTest.intHandleTest       avgt    5  2778.602 ± 10557.068  ns/op
>
> Could it be that some C2 improvements or similar are proposed?
>