[foreign-memaccess+abi] RFR: Add benchmarks to MemorySegmentVsBits

Tue Jan 3 18:48:09 UTC 2023

On Tue, 3 Jan 2023 17:28:21 GMT, Paul Sandoz <psandoz at openjdk.org> wrote:

> > Looks good. It seems like the memory segment implementation breaks even at size = 16? On top of my head I can't explain why the plain var handle (`byteVarHandle`) is so much faster than anything else. 1but perhaps the benchmark is flawed there?
> 
> I think we need to look at the generated assembler code to get some more clues as to why.
> 
> Paul.

Did some analysis - on my machine I see evident signs of vectorization in the byte var handle benchmark. Vectorization doesn't seem to be there in other cases (and the resulting code is also much bigger).

E.g. memory segment loop:

0x00007f9598837d80:   mov    %ebx,%r8d
  0x00007f9598837d83:   movslq %r8d,%r10
  0x00007f9598837d86:   add    (%rsp),%r10
  0x00007f9598837d8a:   mov    0x48(%r9,%r10,8),%r11
  0x00007f9598837d8f:   mov    0x40(%r9,%r10,8),%rcx
  0x00007f9598837d94:   mov    0x38(%r9,%r10,8),%rbx
  0x00007f9598837d99:   mov    0x30(%r9,%r10,8),%rdx
  0x00007f9598837d9e:   mov    0x28(%r9,%r10,8),%rsi
  0x00007f9598837da3:   mov    0x20(%r9,%r10,8),%rax
  0x00007f9598837da8:   mov    0x18(%r9,%r10,8),%rbp
  0x00007f9598837dad:   mov    0x10(%r9,%r10,8),%r13
  0x00007f9598837db2:   lea    0x0(,%r8,8),%r10d
  0x00007f9598837dba:   movslq %r10d,%r10
  0x00007f9598837dbd:   mov    %rdi,%r14
  0x00007f9598837dc0:   add    %r10,%r14
  0x00007f9598837dc3:   mov    %r13,(%r14)
  0x00007f9598837dc6:   mov    %rbp,0x8(%r14)
  0x00007f9598837dca:   mov    %rax,0x10(%r14)
  0x00007f9598837dce:   mov    %rsi,0x18(%r14)
  0x00007f9598837dd2:   mov    %rdx,0x20(%r14)
  0x00007f9598837dd6:   mov    %rbx,0x28(%r14)
  0x00007f9598837dda:   mov    %rcx,0x30(%r14)
  0x00007f9598837dde:   mov    %r11,0x38(%r14)
  0x00007f9598837de2:   lea    0x8(%r8),%ebx
  0x00007f9598837de6:   cmp    0x14(%rsp),%ebx
  0x00007f9598837dea:   jl     0x00007f9598837d80
  ```

  Byte var handle loop:

  ```
  0x00007f19a8837930:   vmovdqu 0x70(%rax,%r8,8),%ymm0
  0x00007f19a8837937:   vmovdqu 0x50(%rax,%r8,8),%ymm1
  0x00007f19a883793e:   vmovdqu 0x10(%rax,%r8,8),%ymm2
  0x00007f19a8837945:   vmovdqu 0x30(%rax,%r8,8),%ymm3
  0x00007f19a883794c:   lea    0x0(,%r8,8),%ebx
  0x00007f19a8837954:   movslq %ebx,%rbx
  0x00007f19a8837957:   vmovdqu %ymm2,0x10(%rsi,%rbx,1)
  0x00007f19a883795d:   vmovdqu %ymm3,0x30(%rsi,%rbx,1)
  0x00007f19a8837963:   vmovdqu %ymm1,0x50(%rsi,%rbx,1)
  0x00007f19a8837969:   vmovdqu %ymm0,0x70(%rsi,%rbx,1)
  0x00007f19a883796f:   add    $0x10,%r8d
  0x00007f19a8837973:   cmp    %edi,%r8d
  0x00007f19a8837976:   jl     0x00007f19a8837930
  ```

  All checks are hoisted out of the hot loop - so in principle the segment code should vectorize as well?

-------------

PR: https://git.openjdk.org/panama-foreign/pull/762