[foreign-memaccess+abi] RFR: Add benchmarks to MemorySegmentVsBits
Maurizio Cimadamore
mcimadamore at openjdk.org
Tue Jan 3 18:48:09 UTC 2023
On Tue, 3 Jan 2023 17:28:21 GMT, Paul Sandoz <psandoz at openjdk.org> wrote:
> > Looks good. It seems like the memory segment implementation breaks even at size = 16? On top of my head I can't explain why the plain var handle (`byteVarHandle`) is so much faster than anything else. 1but perhaps the benchmark is flawed there?
>
> I think we need to look at the generated assembler code to get some more clues as to why.
>
> Paul.
Did some analysis - on my machine I see evident signs of vectorization in the byte var handle benchmark. Vectorization doesn't seem to be there in other cases (and the resulting code is also much bigger).
E.g. memory segment loop:
0x00007f9598837d80: mov %ebx,%r8d
0x00007f9598837d83: movslq %r8d,%r10
0x00007f9598837d86: add (%rsp),%r10
0x00007f9598837d8a: mov 0x48(%r9,%r10,8),%r11
0x00007f9598837d8f: mov 0x40(%r9,%r10,8),%rcx
0x00007f9598837d94: mov 0x38(%r9,%r10,8),%rbx
0x00007f9598837d99: mov 0x30(%r9,%r10,8),%rdx
0x00007f9598837d9e: mov 0x28(%r9,%r10,8),%rsi
0x00007f9598837da3: mov 0x20(%r9,%r10,8),%rax
0x00007f9598837da8: mov 0x18(%r9,%r10,8),%rbp
0x00007f9598837dad: mov 0x10(%r9,%r10,8),%r13
0x00007f9598837db2: lea 0x0(,%r8,8),%r10d
0x00007f9598837dba: movslq %r10d,%r10
0x00007f9598837dbd: mov %rdi,%r14
0x00007f9598837dc0: add %r10,%r14
0x00007f9598837dc3: mov %r13,(%r14)
0x00007f9598837dc6: mov %rbp,0x8(%r14)
0x00007f9598837dca: mov %rax,0x10(%r14)
0x00007f9598837dce: mov %rsi,0x18(%r14)
0x00007f9598837dd2: mov %rdx,0x20(%r14)
0x00007f9598837dd6: mov %rbx,0x28(%r14)
0x00007f9598837dda: mov %rcx,0x30(%r14)
0x00007f9598837dde: mov %r11,0x38(%r14)
0x00007f9598837de2: lea 0x8(%r8),%ebx
0x00007f9598837de6: cmp 0x14(%rsp),%ebx
0x00007f9598837dea: jl 0x00007f9598837d80
```
Byte var handle loop:
```
0x00007f19a8837930: vmovdqu 0x70(%rax,%r8,8),%ymm0
0x00007f19a8837937: vmovdqu 0x50(%rax,%r8,8),%ymm1
0x00007f19a883793e: vmovdqu 0x10(%rax,%r8,8),%ymm2
0x00007f19a8837945: vmovdqu 0x30(%rax,%r8,8),%ymm3
0x00007f19a883794c: lea 0x0(,%r8,8),%ebx
0x00007f19a8837954: movslq %ebx,%rbx
0x00007f19a8837957: vmovdqu %ymm2,0x10(%rsi,%rbx,1)
0x00007f19a883795d: vmovdqu %ymm3,0x30(%rsi,%rbx,1)
0x00007f19a8837963: vmovdqu %ymm1,0x50(%rsi,%rbx,1)
0x00007f19a8837969: vmovdqu %ymm0,0x70(%rsi,%rbx,1)
0x00007f19a883796f: add $0x10,%r8d
0x00007f19a8837973: cmp %edi,%r8d
0x00007f19a8837976: jl 0x00007f19a8837930
```
All checks are hoisted out of the hot loop - so in principle the segment code should vectorize as well?
-------------
PR: https://git.openjdk.org/panama-foreign/pull/762
More information about the panama-dev
mailing list