[vector] Benchmarking Vector API: some observations

Fri Aug 3 22:44:29 UTC 2018

Hi,

Recently I spent some time looking into Vector API benchmarks developed 
by Richard Startin [1] and wanted to share some observations I made.

I focused on FloatMatrixMatrixMultiplication [2] because there's a C++ 
counterpart available [3] to compare with. Also, it illustrates a case 
when manual unrolling makes sense.

Though Vector API version performs much better than scalar one, it 
significantly lags behind C++ equivalent:

$ java ... -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0 ...
   64    58736.561 ± 825.324  ops/s
  128     7364.705 ±  85.297  ops/s
  256      889.648 ±   8.369  ops/s
  512       93.170 ±   3.542  ops/s
1024        8.178 ±   0.596  ops/s

vs

$ ./mmul
size   throughput  flops/cycle
         (ops/s)
   64    159166         32
  128     20545         33
  256      2359         30
  512       249         26
1024        16         14

Looking at assembly, C++ variant has very dense code in inner loop:

LBB5_6:
   vbroadcastss  (%r11), %ymm8
   vfmadd231ps -224(%r9), %ymm8, %ymm7
   vfmadd231ps -192(%r9), %ymm8, %ymm6
   vfmadd231ps -160(%r9), %ymm8, %ymm5
   vfmadd231ps -128(%r9), %ymm8, %ymm4
   vfmadd231ps -96(%r9), %ymm8, %ymm3
   vfmadd231ps -64(%r9), %ymm8, %ymm2
   vfmadd231ps -32(%r9), %ymm8, %ymm1
   vfmadd231ps (%r9), %ymm8, %ymm0
   cmpq  %r10, %rdi
   jge LBB5_8
   addq  %rbx, %r9
   addq  $4, %r11
   cmpq  %rax, %rdi
   leaq  1(%rdi), %rdi
   jl  LBB5_6

Vector API version is less efficient [4] mainly due to the following 
problem:

    YMM_FLOAT.fromArray(right, k * n + j + 24).fma(multiplier, sum4);

    Constant offset isn't folded into the memory access instruction:
      mov    %r10d,%ebx
      add    $0x30,%ebx
      movslq %ebx,%rbx
      vmovdqu 0x10(%rcx,%rbx,4),%ymm5
      vfmadd231ps %ymm6,%ymm5,%ymm14

   The problem is caused by int -> long conversion which happens inside 
fromArray(), so C2 can't fold 2 constants:
      (((long) ix) << ARRAY_SHIFT) + Unsafe.ARRAY_FLOAT_BASE_OFFSET

   (2) YMM_FLOAT.broadcast(left[i * n + k]) uses Float.floatToIntBits() 
which canonicalizes NaNs:

I hacked a workaround:
   http://cr.openjdk.java.net/~vlivanov/panama/vector/mmul/

Also, to simplify comparison on machine code level:

   * introduced vbroadcastss and memory addressing variant of 
vfmadd231ps (so memory accesses are folded into the instruction)

   *  switched YMM_FLOAT.broadcast from Float.floatToIntBits() to 
Float.floatToRawIntBits() to avoid additional branch for NaN 
canonicalization:
     vmovss 0x10(%rdx,%r10,4),%xmm5
     vucomiss %xmm5,%xmm5
     jp     L1
     jne    L1
...
L1:
     vmovaps %xmm7,%xmm5

With aforementioned changes the numbers are much better (up to 2.5x), 
especially on small inputs:

$ java ... -XX:+UseNewCode -XX:+UseNewCode2
-Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0 ...
   64   101072.924 ± 6440.093  ops/s
  128    13284.371 ±  879.124  ops/s
  256     1378.167 ±   42.474  ops/s
  512      130.317 ±    4.484  ops/s
1024       10.613 ±    0.230  ops/s

The results are quite noisy on small inputs due to data misalignment 
(size=64, 100 iterations):

   (min, avg, max) = (89966.246, 103201.144, 122003.245), stdev = 8804.719

I tried to fix alignment by switching to DirectByteBuffer, but it 
requires more work to workaround offset computations and my hack didn't 
work anymore.

Also, I ran the benchmark with range checks enabled and they are 
commoned inside the loop body, but aren't hoisted out of it in the inner 
loop. It adds 5-10% overhead:

$ java ... -XX:+UseNewCode -XX:+UseNewCode2
-Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=2 ...
   64    94994.871 ± 6177.716  ops/s
  128    12196.995 ±  830.416  ops/s
  256     1466.335 ±   42.757  ops/s
  512      133.251 ±    4.523  ops/s
1024       10.168 ±    0.286  ops/s

That's it for today :-) There's definitely more work needed to match the 
performance of C++ version, but the gap is much narrower now.

To summarize:

   (1) Offset computations (both on-heap & off-heap) need some 
attention. It shouldn't be as problematic when unrolling happens, but 
it's a real problem for scenarios with manual unrolling.

   (2) Data alignment matters, but there's a standard way to enforce it 
only on DirectByteBuffers

Best regards,
Vladimir Ivanov

[1] https://github.com/richardstartin/vectorbenchmarks

[2] 
https://github.com/richardstartin/vectorbenchmarks/blob/master/src/main/java/com/openkappa/panama/vectorbenchmarks/FloatMatrixMatrixMultiplication.java#L45

[3] https://github.com/richardstartin/cppavxbenchmarks

[4]
L1:
     vmovaps %xmm7,%xmm5
L2:
     vpshufd $0x0,%xmm5,%xmm6
     vinsertf128 $0x1,%xmm6,%ymm6,%ymm6
     mov    %edi,%r10d
     imul   0x138(%rsp),%r10d
     add    %r9d,%r10d
     mov    %r10d,%ebx
     add    $0x38,%ebx
     movslq %ebx,%rbx
     vmovdqu 0x10(%rcx,%rbx,4),%ymm5
     vfmadd231ps %ymm6,%ymm5,%ymm3
     mov    %r10d,%ebx
     add    $0x30,%ebx
     movslq %ebx,%rbx
     vmovdqu 0x10(%rcx,%rbx,4),%ymm5
     vfmadd231ps %ymm6,%ymm5,%ymm14
     mov    %r10d,%ebx
     add    $0x28,%ebx
     movslq %ebx,%rbx
     vmovdqu 0x10(%rcx,%rbx,4),%ymm5
     vfmadd231ps %ymm6,%ymm5,%ymm13
     mov    %r10d,%ebx
     add    $0x20,%ebx
     movslq %ebx,%rbx
     vmovdqu 0x10(%rcx,%rbx,4),%ymm5
     vfmadd231ps %ymm6,%ymm5,%ymm12
     mov    %r10d,%ebx
     add    $0x18,%ebx
     movslq %ebx,%rbx
     vmovdqu 0x10(%rcx,%rbx,4),%ymm5
     vfmadd231ps %ymm6,%ymm5,%ymm11
     mov    %r10d,%ebx
     add    $0x10,%ebx
     movslq %ebx,%rbx
     vmovdqu 0x10(%rcx,%rbx,4),%ymm5
     vfmadd231ps %ymm6,%ymm5,%ymm10
     mov    %r10d,%ebx
     add    $0x8,%ebx
     movslq %r10d,%r10
     vmovdqu 0x10(%rcx,%r10,4),%ymm5
     vfmadd231ps %ymm6,%ymm5,%ymm8
     movslq %ebx,%r10
     vmovdqu 0x10(%rcx,%r10,4),%ymm5
     vfmadd231ps %ymm6,%ymm5,%ymm9
     inc    %edi
     cmp    %esi,%edi
     jge    0x000000010fc79c52
     mov    %edi,%r10d
     add    %r11d,%r10d
     vmovss 0x10(%rdx,%r10,4),%xmm5
     vucomiss %xmm5,%xmm5
     jp     L1
     jne    L1
     jmpq   L2