RFR (14) 8235837: Memory access API refinements
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Thu Jan 16 15:15:21 UTC 2020
On 16/01/2020 14:50, Andrew Haley wrote:
> On 1/15/20 6:48 PM, Maurizio Cimadamore wrote:
>> Maybe this would be best moved on panama-dev?
> Here we are.
>
>> In any case, for obtaining best performances, it is best to use an
>> indexed (or strided) var handle - your loop will create a new memory
>> address on each new iteration, which will not be a problem once
>> MemoryAddress will be an inline type, but in the meantime...
> OK. It's rather unfortunate that your example at the very top of the
> API description uses an anti-pattern, at least from a performance
> point of view. Might I suggest that it should be changed to the best
> way to do the job?
That's a good suggestion, thanks. The rational behind the current
example was to provide a simple example so that people could familiarize
with the API - I'm worried that throwing strided access into the mix in
the first example could be a bit steep, but I'll see what I can do.
>
>> We have some benchmarks here:
>>
>> http://hg.openjdk.java.net/panama/dev/file/5249395528dc/test/micro/org/openjdk/bench/jdk/incubator/foreign
>>
>> Your test seems similar to this:
>>
>> http://hg.openjdk.java.net/panama/dev/file/5249395528dc/test/micro/org/openjdk/bench/jdk/incubator/foreign/LoopOverNew.java
>>
>> In the panama repo this benchmark obtains same numbers as bytebuffer,
>> and same loop unrolling (but the panama repo has one performance
>> optimization that JDK 14 doesn't yet have, to workaround the lack of
>> optimization with longs used in loops).
> I'm using the panama-dev repo, checkd out yesterday. Your own
> benchmark shows this for memory segments (unrolled twice):
>
> 22.84% ││ 0x00007f628b772159: mov rdi,rdx
> 0.03% ││ 0x00007f628b77215c: add rdi,0x4
> 0.02% ││ 0x00007f628b772160: mov r9d,r10d
> ││ 0x00007f628b772163: inc r9d
> ││ 0x00007f628b772166: cmp rdi,rsi
> ││ 0x00007f628b772169: jg 0x00007f628b77228b
> ││ 0x00007f628b77216f: mov DWORD PTR [rbx+0x4],r9d
> 5.56% ││ 0x00007f628b772173: mov rdi,rdx
> ││ 0x00007f628b772176: add rdi,0x8
> 0.53% ││ 0x00007f628b77217a: mov r9d,r10d
> 0.01% ││ 0x00007f628b77217d: add r9d,0x2
> 0.01% ││ 0x00007f628b772181: cmp rdi,rsi
> 0.01% ││ 0x00007f628b772184: jg 0x00007f628b772286
> ││ 0x00007f628b77218a: mov DWORD PTR [rbx+0x8],r9d
>
> And this for ByteBuffers:
>
> 15.62% ││ 0x00007fca53f70bca: mov r9d,ebx
> 0.02% ││ 0x00007fca53f70bcd: add r9d,0x2
> 0.02% ││ 0x00007fca53f70bd1: mov DWORD PTR [rcx+0x8],r9d
> 3.23% ││ 0x00007fca53f70bd5: mov r9d,ebx
> ││ 0x00007fca53f70bd8: add r9d,0x3
> ││ 0x00007fca53f70bdc: mov DWORD PTR [rcx+0xc],r9d
>
> The bounds checks in the memory segment version are well predicted as
> you'd expect, but this isn't good code.
>
> If you change the store to a constant rather than the loop index, the
> memory segment version looks much the same as above, but the
> ByteBuffer version is unrolled and vectorized:
>
> 0.50% ││ 0x00007f5d9ff7262f: vmovdqu YMMWORD PTR [rbp+0x0],ymm0
> 2.36% ││ 0x00007f5d9ff72634: vmovdqu YMMWORD PTR [rbp+0x20],ymm0
> ...
>
> Benchmark Mode Cnt Score Error Units
> LoopOverNew.buffer_loop avgt 10 0.286 ± 0.008 ms/op
> LoopOverNew.segment_loop avgt 10 0.513 ± 0.029 ms/op
> LoopOverNew.unsafe_loop avgt 10 0.348 ± 0.010 ms/op
>
> I don't know why the Unsafe version fails to vectorize, but it's still
> better than the memory segment version.
>
>> This has been rectified with an implementation change which allows
>> us to use ints instead of longs in bound checks, when the API can
>> prove that the segment is small - that work is described in this
>> thread:
>> https://mail.openjdk.java.net/pipermail/panama-dev/2020-January/007081.html
> I guess that change still isn't committed to the Panama repo? But
> anyway, the loops are all ints, with a constant bound.
This change has been pushed to the panama repo. The loops might be int,
but internally the memory access machinery uses long operations to sum
offsets together and multiply them, which is causing BCE and other
optimization to fail (since these optimizations looks for IADD, IMUL
opcodes).
The numbers you report above seems consistent to what we have in JDK 14,
and in panama before the fix/workaround I mentioned. Are you sure you
are testing with the latest bits? Could you be on the panama 'default'
branch instead of 'foreign-memaccess' ?
Here I get this:
Benchmark Mode Cnt Score Error Units
LoopOverNew.buffer_loop avgt 30 0.623 ? 0.003 ms/op
LoopOverNew.segment_loop avgt 30 0.624 ? 0.005 ms/op
LoopOverNew.unsafe_loop avgt 30 0.400 ? 0.002 ms/op
Unsafe comes out on top, but that's because of memory zeroing - if you
disable it using -Djdk.internal.foreign.skipZeroMemory=true then numbers
of memory access API are on par with Unsafe.
Btw, on my machine I see lots of unrolling, but no vectorization, not
even for ByteBuffer.
Maurizio
>
>> And the corresponding, longer term C2 fix is captured here:
>>
>> https://bugs.openjdk.java.net/browse/JDK-8223051
>>
>> That said, even w/o that performance fix, I wouldn't expect the memory
>> access API to be 4x slower. I'd start by dropping the acquire() [which
>> you probably don't need and it's doing a CAS], and moving to indexed var
>> handle (by replicating the benchmark code linked above) and see if that
>> works better.
>
More information about the panama-dev
mailing list