RFR (14) 8235837: Memory access API refinements
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Wed Jan 15 18:48:55 UTC 2020
Maybe this would be best moved on panama-dev?
In any case, for obtaining best performances, it is best to use an
indexed (or strided) var handle - your loop will create a new memory
address on each new iteration, which will not be a problem once
MemoryAddress will be an inline type, but in the meantime...
We have some benchmarks here:
http://hg.openjdk.java.net/panama/dev/file/5249395528dc/test/micro/org/openjdk/bench/jdk/incubator/foreign
Your test seems similar to this:
http://hg.openjdk.java.net/panama/dev/file/5249395528dc/test/micro/org/openjdk/bench/jdk/incubator/foreign/LoopOverNew.java
In the panama repo this benchmark obtains same numbers as bytebuffer,
and same loop unrolling (but the panama repo has one performance
optimization that JDK 14 doesn't yet have, to workaround the lack of
optimization with longs used in loops). This has been rectified with an
implementation change which allows us to use ints instead of longs in
bound checks, when the API can prove that the segment is small - that
work is described in this thread:
https://mail.openjdk.java.net/pipermail/panama-dev/2020-January/007081.html
And the corresponding, longer term C2 fix is captured here:
https://bugs.openjdk.java.net/browse/JDK-8223051
That said, even w/o that performance fix, I wouldn't expect the memory
access API to be 4x slower. I'd start by dropping the acquire() [which
you probably don't need and it's doing a CAS], and moving to indexed var
handle (by replicating the benchmark code linked above) and see if that
works better.
Maurizio
On 15/01/2020 18:00, Andrew Haley wrote:
> On 1/9/20 4:37 PM, Maurizio Cimadamore wrote:
>> There you go
>>
>> cr.openjdk.java.net/~mcimadamore/8235837_javadoc
> Thank you.
>
> So I've been kicking the tyres, and I'm rather surprised at how poor
> the performance seems to be. My simple test, like this:
>
> @Benchmark
> public void intHandleTest(BenchmarkState state) {
> try (var segment = BenchmarkState.segment.acquire()) {
> var base = segment.baseAddress();
> final var byteSize = ARRAY_SIZE * 4;
> for (int i = 0; i < byteSize; i += 4) {
> BenchmarkState.intHandle.set(base.offset(i), (int) 4);
> }
> }
> }
>
> has a great deal of overhead. It was a bit of a struggle to get it to
> unroll nicely, and the best I could get was
>
> 6.90% │ 0x00007faeeff7dec8: mov r9d,r11d
> │ 0x00007faeeff7decb: add r9d,0x4 ;*iinc {reexecute=0 rethrow=0 return_oop=0}
> │ ; - org.sample.MemoryHandlesTest::intHandleTest at 45 (line 34)
> │ ; - org.sample.generated.MemoryHandlesTest_intHandleTest_jmhTest::intHandleTest_avgt_jmhStub at 17 (line 191)
> │ 0x00007faeeff7decf: mov rdx,rbx
> │ 0x00007faeeff7ded2: add rdx,0x10 ;*i2l {reexecute=0 rethrow=0 return_oop=0}
> │ ; - org.sample.MemoryHandlesTest::intHandleTest at 35 (line 35)
> │ ; - org.sample.generated.MemoryHandlesTest_intHandleTest_jmhTest::intHandleTest_avgt_jmhStub at 17 (line 191)
> 0.06% │ 0x00007faeeff7ded6: cmp rdx,rdi
> │ 0x00007faeeff7ded9: jg 0x00007faeeff7df94 ;*ifle {reexecute=0 rethrow=0 return_oop=0}
> │ ; - jdk.internal.foreign.MemorySegmentImpl::checkBounds at 20 (line 196)
> │ ; - jdk.internal.foreign.MemorySegmentImpl::checkRange at 29 (line 178)
> │ ; - jdk.internal.foreign.MemoryAddressImpl::checkAccess at 21 (line 84)
> │ ; - java.lang.invoke.VarHandleMemoryAddressAsInts::checkAddress at 15 (line 50)
> │ ; - java.lang.invoke.VarHandleMemoryAddressAsInts::set0 at 7 (line 85)
> │ ; - java.lang.invoke.VarHandleMemoryAddressAsInts0/0x0000000800bc3840::set at 7
> │ ; - java.lang.invoke.VarHandleGuards::guard_LI_V at 33 (line 114)
> │ ; - org.sample.MemoryHandlesTest::intHandleTest at 42 (line 35)
> │ ; - org.sample.generated.MemoryHandlesTest_intHandleTest_jmhTest::intHandleTest_avgt_jmhStub at 17 (line 191)
> │ 0x00007faeeff7dedf: mov DWORD PTR [rsi+0x10],0x4 ;*invokevirtual putIntUnaligned {reexecute=0 rethrow=0 return_oop=0}
> │ ; - jdk.internal.misc.Unsafe::putIntUnaligned at 10 (line 3693)
> │ ; - java.lang.invoke.VarHandleMemoryAddressAsInts::set0 at 38 (line 86)
> │ ; - java.lang.invoke.VarHandleMemoryAddressAsInts0/0x0000000800bc3840::set at 7
> │ ; - java.lang.invoke.VarHandleGuards::guard_LI_V at 33 (line 114)
> │ ; - org.sample.MemoryHandlesTest::intHandleTest at 42 (line 35)
> │ ; - org.sample.generated.MemoryHandlesTest_intHandleTest_jmhTest::intHandleTest_avgt_jmhStub at 17 (line 191)
>
> for every store. In contrast, similar ByteBuffer code looks like:
>
>
> 0.08% ↗ 0x00007f3b5bf717c0: movsxd r13,r8d
> 0.16% │ 0x00007f3b5bf717c3: mov r14,rdx
> │ 0x00007f3b5bf717c6: add r14,r13
> 1.00% │ 0x00007f3b5bf717c9: movsxd r13,r8d
> 0.04% │ 0x00007f3b5bf717cc: vmovdqu YMMWORD PTR [rdx+r13*1],ymm4
> 6.87% │ 0x00007f3b5bf717d2: vmovdqu YMMWORD PTR [r14+0x20],ymm4
> 5.77% │ 0x00007f3b5bf717d8: vmovdqu YMMWORD PTR [r14+0x40],ymm4
> 3.99% │ 0x00007f3b5bf717de: vmovdqu YMMWORD PTR [r14+0x60],ymm4
> 6.09% │ 0x00007f3b5bf717e4: vmovdqu YMMWORD PTR [r14+0x80],ymm4
> 4.97% │ 0x00007f3b5bf717ed: vmovdqu YMMWORD PTR [r14+0xa0],ymm4
> 4.93% │ 0x00007f3b5bf717f6: vmovdqu YMMWORD PTR [r14+0xc0],ymm4
> 5.07% │ 0x00007f3b5bf717ff: vmovdqu YMMWORD PTR [r14+0xe0],ymm4
> 4.87% │ 0x00007f3b5bf71808: vmovdqu YMMWORD PTR [r14+0x100],ymm4
> 7.39% │ 0x00007f3b5bf71811: vmovdqu YMMWORD PTR [r14+0x120],ymm4
> 5.19% │ 0x00007f3b5bf7181a: vmovdqu YMMWORD PTR [r14+0x140],ymm4
> 6.21% │ 0x00007f3b5bf71823: vmovdqu YMMWORD PTR [r14+0x160],ymm4
> 4.93% │ 0x00007f3b5bf7182c: vmovdqu YMMWORD PTR [r14+0x180],ymm4
> 5.69% │ 0x00007f3b5bf71835: vmovdqu YMMWORD PTR [r14+0x1a0],ymm4
> 11.28% │ 0x00007f3b5bf7183e: vmovdqu YMMWORD PTR [r14+0x1c0],ymm4
> 4.83% │ 0x00007f3b5bf71847: vmovdqu YMMWORD PTR [r14+0x1e0],ymm4;*invokevirtual putIntUnaligned {reexecute=0 rethrow=0 return_oop=0}
> │ ; - jdk.internal.misc.Unsafe::putIntUnaligned at 10 (line 3693)
> │ ; - java.nio.DirectByteBuffer::putInt at 18 (line 860)
> │ ; - java.nio.DirectByteBuffer::putInt at 12 (line 881)
> │ ; - org.sample.ByteBufferTest::floss at 15 (line 34)
> │ ; - org.sample.ByteBufferTest::test at 14 (line 42)
> │ ; - org.sample.generated.ByteBufferTest_test_jmhTest::test_avgt_jmhStub at 17 (line 241)
> 2.85% │ 0x00007f3b5bf71850: add r8d,0x200 ;*iinc {reexecute=0 rethrow=0 return_oop=0}
> │ ; - org.sample.ByteBufferTest::floss at 19 (line 33)
> │ ; - org.sample.ByteBufferTest::test at 14 (line 42)
> │ ; - org.sample.generated.ByteBufferTest_test_jmhTest::test_avgt_jmhStub at 17 (line 241)
> │ 0x00007f3b5bf71857: cmp r8d,ecx
> ╰ 0x00007f3b5bf7185a: jl 0x00007f3b5bf717c0 ;*goto {reexecute=0 rethrow=0 return_oop=0}
>
> nice, eh?
>
> Benchmark Mode Cnt Score Error Units
> ByteBufferTest.test avgt 5 620.628 ± 2.947 ns/op
> MemoryHandlesTest.intHandleTest avgt 5 2778.602 ± 10557.068 ns/op
>
> Could it be that some C2 improvements or similar are proposed?
>
More information about the core-libs-dev
mailing list