RFR (14) 8235837: Memory access API refinements

Thu Jan 16 15:15:21 UTC 2020

On 16/01/2020 14:50, Andrew Haley wrote:
> On 1/15/20 6:48 PM, Maurizio Cimadamore wrote:
>> Maybe this would be best moved on panama-dev?
> Here we are.
>
>> In any case, for obtaining best performances, it is best to use an
>> indexed (or strided) var handle - your loop will create a new memory
>> address on each new iteration, which will not be a problem once
>> MemoryAddress will be an inline type, but in the meantime...
> OK. It's rather unfortunate that your example at the very top of the
> API description uses an anti-pattern, at least from a performance
> point of view. Might I suggest that it should be changed to the best
> way to do the job?
That's a good suggestion, thanks. The rational behind the current 
example was to provide a simple example so that people could familiarize 
with the API - I'm worried that throwing strided access into the mix in 
the first example could be a bit steep, but I'll see what I can do.
>
>> We have some benchmarks here:
>>
>> http://hg.openjdk.java.net/panama/dev/file/5249395528dc/test/micro/org/openjdk/bench/jdk/incubator/foreign
>>
>> Your test seems similar to this:
>>
>> http://hg.openjdk.java.net/panama/dev/file/5249395528dc/test/micro/org/openjdk/bench/jdk/incubator/foreign/LoopOverNew.java
>>
>> In the panama repo this benchmark obtains same numbers as bytebuffer,
>> and same loop unrolling (but the panama repo has one performance
>> optimization that JDK 14 doesn't yet have, to workaround the lack of
>> optimization with longs used in loops).
> I'm using the panama-dev repo, checkd out yesterday. Your own
> benchmark shows this for memory segments (unrolled twice):
>
>   22.84%  ││  0x00007f628b772159:   mov    rdi,rdx
>    0.03%  ││  0x00007f628b77215c:   add    rdi,0x4
>    0.02%  ││  0x00007f628b772160:   mov    r9d,r10d
>           ││  0x00007f628b772163:   inc    r9d
>           ││  0x00007f628b772166:   cmp    rdi,rsi
>           ││  0x00007f628b772169:   jg     0x00007f628b77228b
>           ││  0x00007f628b77216f:   mov    DWORD PTR [rbx+0x4],r9d
>    5.56%  ││  0x00007f628b772173:   mov    rdi,rdx
>           ││  0x00007f628b772176:   add    rdi,0x8
>    0.53%  ││  0x00007f628b77217a:   mov    r9d,r10d
>    0.01%  ││  0x00007f628b77217d:   add    r9d,0x2
>    0.01%  ││  0x00007f628b772181:   cmp    rdi,rsi
>    0.01%  ││  0x00007f628b772184:   jg     0x00007f628b772286
>           ││  0x00007f628b77218a:   mov    DWORD PTR [rbx+0x8],r9d
>
> And this for ByteBuffers:
>
>   15.62%  ││  0x00007fca53f70bca:   mov    r9d,ebx
>    0.02%  ││  0x00007fca53f70bcd:   add    r9d,0x2
>    0.02%  ││  0x00007fca53f70bd1:   mov    DWORD PTR [rcx+0x8],r9d
>    3.23%  ││  0x00007fca53f70bd5:   mov    r9d,ebx
>           ││  0x00007fca53f70bd8:   add    r9d,0x3
>           ││  0x00007fca53f70bdc:   mov    DWORD PTR [rcx+0xc],r9d
>
> The bounds checks in the memory segment version are well predicted as
> you'd expect, but this isn't good code.
>
> If you change the store to a constant rather than the loop index, the
> memory segment version looks much the same as above, but the
> ByteBuffer version is unrolled and vectorized:
>
>    0.50%  ││  0x00007f5d9ff7262f:   vmovdqu YMMWORD PTR [rbp+0x0],ymm0
>    2.36%  ││  0x00007f5d9ff72634:   vmovdqu YMMWORD PTR [rbp+0x20],ymm0
>    ...
>
> Benchmark                      Mode  Cnt  Score   Error  Units
> LoopOverNew.buffer_loop        avgt   10  0.286 ± 0.008  ms/op
> LoopOverNew.segment_loop       avgt   10  0.513 ± 0.029  ms/op
> LoopOverNew.unsafe_loop        avgt   10  0.348 ± 0.010  ms/op
>
> I don't know why the Unsafe version fails to vectorize, but it's still
> better than the memory segment version.
>
>> This has been rectified with an implementation change which allows
>> us to use ints instead of longs in bound checks, when the API can
>> prove that the segment is small - that work is described in this
>> thread:
>> https://mail.openjdk.java.net/pipermail/panama-dev/2020-January/007081.html
> I guess that change still isn't committed to the Panama repo? But
> anyway, the loops are all ints, with a constant bound.

This change has been pushed to the panama repo. The loops might be int, 
but internally the memory access machinery uses long operations to sum 
offsets together and multiply them, which is causing BCE and other 
optimization to fail (since these optimizations looks for IADD, IMUL 
opcodes).

The numbers you report above seems consistent to what we have in JDK 14, 
and in panama before the fix/workaround I mentioned. Are you sure you 
are testing with the latest bits? Could you be on the panama 'default' 
branch instead of 'foreign-memaccess' ?

Here I get this:

Benchmark                 Mode  Cnt  Score   Error  Units
LoopOverNew.buffer_loop   avgt   30  0.623 ? 0.003  ms/op
LoopOverNew.segment_loop  avgt   30  0.624 ? 0.005  ms/op
LoopOverNew.unsafe_loop   avgt   30  0.400 ? 0.002  ms/op

Unsafe comes out on top, but that's because of memory zeroing - if you 
disable it using -Djdk.internal.foreign.skipZeroMemory=true then numbers 
of memory access API are on par with Unsafe.

Btw, on my machine I see lots of unrolling, but no vectorization, not 
even for ByteBuffer.

Maurizio

>
>> And the corresponding, longer term C2 fix is captured here:
>>
>> https://bugs.openjdk.java.net/browse/JDK-8223051
>>
>> That said, even w/o that performance fix, I wouldn't expect the memory
>> access API to be 4x slower. I'd start by dropping the acquire() [which
>> you probably don't need and it's doing a CAS], and moving to indexed var
>> handle (by replicating the benchmark code linked above) and see if that
>> works better.
>