RFR (14) 8235837: Memory access API refinements

Thu Jan 16 14:50:44 UTC 2020

On 1/15/20 6:48 PM, Maurizio Cimadamore wrote:
> Maybe this would be best moved on panama-dev?

Here we are.

> In any case, for obtaining best performances, it is best to use an
> indexed (or strided) var handle - your loop will create a new memory
> address on each new iteration, which will not be a problem once
> MemoryAddress will be an inline type, but in the meantime...

OK. It's rather unfortunate that your example at the very top of the
API description uses an anti-pattern, at least from a performance
point of view. Might I suggest that it should be changed to the best
way to do the job?

> We have some benchmarks here:
>
> http://hg.openjdk.java.net/panama/dev/file/5249395528dc/test/micro/org/openjdk/bench/jdk/incubator/foreign
>
> Your test seems similar to this:
>
> http://hg.openjdk.java.net/panama/dev/file/5249395528dc/test/micro/org/openjdk/bench/jdk/incubator/foreign/LoopOverNew.java
>
> In the panama repo this benchmark obtains same numbers as bytebuffer,
> and same loop unrolling (but the panama repo has one performance
> optimization that JDK 14 doesn't yet have, to workaround the lack of
> optimization with longs used in loops).

I'm using the panama-dev repo, checkd out yesterday. Your own
benchmark shows this for memory segments (unrolled twice):

 22.84%  ││  0x00007f628b772159:   mov    rdi,rdx
  0.03%  ││  0x00007f628b77215c:   add    rdi,0x4
  0.02%  ││  0x00007f628b772160:   mov    r9d,r10d
         ││  0x00007f628b772163:   inc    r9d
         ││  0x00007f628b772166:   cmp    rdi,rsi
         ││  0x00007f628b772169:   jg     0x00007f628b77228b
         ││  0x00007f628b77216f:   mov    DWORD PTR [rbx+0x4],r9d
  5.56%  ││  0x00007f628b772173:   mov    rdi,rdx
         ││  0x00007f628b772176:   add    rdi,0x8
  0.53%  ││  0x00007f628b77217a:   mov    r9d,r10d
  0.01%  ││  0x00007f628b77217d:   add    r9d,0x2
  0.01%  ││  0x00007f628b772181:   cmp    rdi,rsi
  0.01%  ││  0x00007f628b772184:   jg     0x00007f628b772286
         ││  0x00007f628b77218a:   mov    DWORD PTR [rbx+0x8],r9d

And this for ByteBuffers:

 15.62%  ││  0x00007fca53f70bca:   mov    r9d,ebx
  0.02%  ││  0x00007fca53f70bcd:   add    r9d,0x2
  0.02%  ││  0x00007fca53f70bd1:   mov    DWORD PTR [rcx+0x8],r9d
  3.23%  ││  0x00007fca53f70bd5:   mov    r9d,ebx
         ││  0x00007fca53f70bd8:   add    r9d,0x3
         ││  0x00007fca53f70bdc:   mov    DWORD PTR [rcx+0xc],r9d

The bounds checks in the memory segment version are well predicted as
you'd expect, but this isn't good code.

If you change the store to a constant rather than the loop index, the
memory segment version looks much the same as above, but the
ByteBuffer version is unrolled and vectorized:

  0.50%  ││  0x00007f5d9ff7262f:   vmovdqu YMMWORD PTR [rbp+0x0],ymm0
  2.36%  ││  0x00007f5d9ff72634:   vmovdqu YMMWORD PTR [rbp+0x20],ymm0
  ...

Benchmark                      Mode  Cnt  Score   Error  Units
LoopOverNew.buffer_loop        avgt   10  0.286 ± 0.008  ms/op
LoopOverNew.segment_loop       avgt   10  0.513 ± 0.029  ms/op
LoopOverNew.unsafe_loop        avgt   10  0.348 ± 0.010  ms/op

I don't know why the Unsafe version fails to vectorize, but it's still
better than the memory segment version.

> This has been rectified with an implementation change which allows
> us to use ints instead of longs in bound checks, when the API can
> prove that the segment is small - that work is described in this
> thread:
> https://mail.openjdk.java.net/pipermail/panama-dev/2020-January/007081.html

I guess that change still isn't committed to the Panama repo? But
anyway, the loops are all ints, with a constant bound.

> And the corresponding, longer term C2 fix is captured here:
>
> https://bugs.openjdk.java.net/browse/JDK-8223051
>
> That said, even w/o that performance fix, I wouldn't expect the memory
> access API to be 4x slower. I'd start by dropping the acquire() [which
> you probably don't need and it's doing a CAS], and moving to indexed var
> handle (by replicating the benchmark code linked above) and see if that
> works better.

-- 
Andrew Haley  (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671