RFR (14) 8235837: Memory access API refinements
Andrew Haley
aph at redhat.com
Thu Jan 16 14:50:44 UTC 2020
On 1/15/20 6:48 PM, Maurizio Cimadamore wrote:
> Maybe this would be best moved on panama-dev?
Here we are.
> In any case, for obtaining best performances, it is best to use an
> indexed (or strided) var handle - your loop will create a new memory
> address on each new iteration, which will not be a problem once
> MemoryAddress will be an inline type, but in the meantime...
OK. It's rather unfortunate that your example at the very top of the
API description uses an anti-pattern, at least from a performance
point of view. Might I suggest that it should be changed to the best
way to do the job?
> We have some benchmarks here:
>
> http://hg.openjdk.java.net/panama/dev/file/5249395528dc/test/micro/org/openjdk/bench/jdk/incubator/foreign
>
> Your test seems similar to this:
>
> http://hg.openjdk.java.net/panama/dev/file/5249395528dc/test/micro/org/openjdk/bench/jdk/incubator/foreign/LoopOverNew.java
>
> In the panama repo this benchmark obtains same numbers as bytebuffer,
> and same loop unrolling (but the panama repo has one performance
> optimization that JDK 14 doesn't yet have, to workaround the lack of
> optimization with longs used in loops).
I'm using the panama-dev repo, checkd out yesterday. Your own
benchmark shows this for memory segments (unrolled twice):
22.84% ││ 0x00007f628b772159: mov rdi,rdx
0.03% ││ 0x00007f628b77215c: add rdi,0x4
0.02% ││ 0x00007f628b772160: mov r9d,r10d
││ 0x00007f628b772163: inc r9d
││ 0x00007f628b772166: cmp rdi,rsi
││ 0x00007f628b772169: jg 0x00007f628b77228b
││ 0x00007f628b77216f: mov DWORD PTR [rbx+0x4],r9d
5.56% ││ 0x00007f628b772173: mov rdi,rdx
││ 0x00007f628b772176: add rdi,0x8
0.53% ││ 0x00007f628b77217a: mov r9d,r10d
0.01% ││ 0x00007f628b77217d: add r9d,0x2
0.01% ││ 0x00007f628b772181: cmp rdi,rsi
0.01% ││ 0x00007f628b772184: jg 0x00007f628b772286
││ 0x00007f628b77218a: mov DWORD PTR [rbx+0x8],r9d
And this for ByteBuffers:
15.62% ││ 0x00007fca53f70bca: mov r9d,ebx
0.02% ││ 0x00007fca53f70bcd: add r9d,0x2
0.02% ││ 0x00007fca53f70bd1: mov DWORD PTR [rcx+0x8],r9d
3.23% ││ 0x00007fca53f70bd5: mov r9d,ebx
││ 0x00007fca53f70bd8: add r9d,0x3
││ 0x00007fca53f70bdc: mov DWORD PTR [rcx+0xc],r9d
The bounds checks in the memory segment version are well predicted as
you'd expect, but this isn't good code.
If you change the store to a constant rather than the loop index, the
memory segment version looks much the same as above, but the
ByteBuffer version is unrolled and vectorized:
0.50% ││ 0x00007f5d9ff7262f: vmovdqu YMMWORD PTR [rbp+0x0],ymm0
2.36% ││ 0x00007f5d9ff72634: vmovdqu YMMWORD PTR [rbp+0x20],ymm0
...
Benchmark Mode Cnt Score Error Units
LoopOverNew.buffer_loop avgt 10 0.286 ± 0.008 ms/op
LoopOverNew.segment_loop avgt 10 0.513 ± 0.029 ms/op
LoopOverNew.unsafe_loop avgt 10 0.348 ± 0.010 ms/op
I don't know why the Unsafe version fails to vectorize, but it's still
better than the memory segment version.
> This has been rectified with an implementation change which allows
> us to use ints instead of longs in bound checks, when the API can
> prove that the segment is small - that work is described in this
> thread:
> https://mail.openjdk.java.net/pipermail/panama-dev/2020-January/007081.html
I guess that change still isn't committed to the Panama repo? But
anyway, the loops are all ints, with a constant bound.
> And the corresponding, longer term C2 fix is captured here:
>
> https://bugs.openjdk.java.net/browse/JDK-8223051
>
> That said, even w/o that performance fix, I wouldn't expect the memory
> access API to be 4x slower. I'd start by dropping the acquire() [which
> you probably don't need and it's doing a CAS], and moving to indexed var
> handle (by replicating the benchmark code linked above) and see if that
> works better.
--
Andrew Haley (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671
More information about the panama-dev
mailing list