[vector] Benchmarking Vector API: some observations
Vladimir Ivanov
vladimir.x.ivanov at oracle.com
Fri Aug 3 22:44:29 UTC 2018
Hi,
Recently I spent some time looking into Vector API benchmarks developed
by Richard Startin [1] and wanted to share some observations I made.
I focused on FloatMatrixMatrixMultiplication [2] because there's a C++
counterpart available [3] to compare with. Also, it illustrates a case
when manual unrolling makes sense.
Though Vector API version performs much better than scalar one, it
significantly lags behind C++ equivalent:
$ java ... -Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0 ...
64 58736.561 ± 825.324 ops/s
128 7364.705 ± 85.297 ops/s
256 889.648 ± 8.369 ops/s
512 93.170 ± 3.542 ops/s
1024 8.178 ± 0.596 ops/s
vs
$ ./mmul
size throughput flops/cycle
(ops/s)
64 159166 32
128 20545 33
256 2359 30
512 249 26
1024 16 14
Looking at assembly, C++ variant has very dense code in inner loop:
LBB5_6:
vbroadcastss (%r11), %ymm8
vfmadd231ps -224(%r9), %ymm8, %ymm7
vfmadd231ps -192(%r9), %ymm8, %ymm6
vfmadd231ps -160(%r9), %ymm8, %ymm5
vfmadd231ps -128(%r9), %ymm8, %ymm4
vfmadd231ps -96(%r9), %ymm8, %ymm3
vfmadd231ps -64(%r9), %ymm8, %ymm2
vfmadd231ps -32(%r9), %ymm8, %ymm1
vfmadd231ps (%r9), %ymm8, %ymm0
cmpq %r10, %rdi
jge LBB5_8
addq %rbx, %r9
addq $4, %r11
cmpq %rax, %rdi
leaq 1(%rdi), %rdi
jl LBB5_6
Vector API version is less efficient [4] mainly due to the following
problem:
YMM_FLOAT.fromArray(right, k * n + j + 24).fma(multiplier, sum4);
Constant offset isn't folded into the memory access instruction:
mov %r10d,%ebx
add $0x30,%ebx
movslq %ebx,%rbx
vmovdqu 0x10(%rcx,%rbx,4),%ymm5
vfmadd231ps %ymm6,%ymm5,%ymm14
The problem is caused by int -> long conversion which happens inside
fromArray(), so C2 can't fold 2 constants:
(((long) ix) << ARRAY_SHIFT) + Unsafe.ARRAY_FLOAT_BASE_OFFSET
(2) YMM_FLOAT.broadcast(left[i * n + k]) uses Float.floatToIntBits()
which canonicalizes NaNs:
I hacked a workaround:
http://cr.openjdk.java.net/~vlivanov/panama/vector/mmul/
Also, to simplify comparison on machine code level:
* introduced vbroadcastss and memory addressing variant of
vfmadd231ps (so memory accesses are folded into the instruction)
* switched YMM_FLOAT.broadcast from Float.floatToIntBits() to
Float.floatToRawIntBits() to avoid additional branch for NaN
canonicalization:
vmovss 0x10(%rdx,%r10,4),%xmm5
vucomiss %xmm5,%xmm5
jp L1
jne L1
...
L1:
vmovaps %xmm7,%xmm5
With aforementioned changes the numbers are much better (up to 2.5x),
especially on small inputs:
$ java ... -XX:+UseNewCode -XX:+UseNewCode2
-Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=0 ...
64 101072.924 ± 6440.093 ops/s
128 13284.371 ± 879.124 ops/s
256 1378.167 ± 42.474 ops/s
512 130.317 ± 4.484 ops/s
1024 10.613 ± 0.230 ops/s
The results are quite noisy on small inputs due to data misalignment
(size=64, 100 iterations):
(min, avg, max) = (89966.246, 103201.144, 122003.245), stdev = 8804.719
I tried to fix alignment by switching to DirectByteBuffer, but it
requires more work to workaround offset computations and my hack didn't
work anymore.
Also, I ran the benchmark with range checks enabled and they are
commoned inside the loop body, but aren't hoisted out of it in the inner
loop. It adds 5-10% overhead:
$ java ... -XX:+UseNewCode -XX:+UseNewCode2
-Djdk.incubator.vector.VECTOR_ACCESS_OOB_CHECK=2 ...
64 94994.871 ± 6177.716 ops/s
128 12196.995 ± 830.416 ops/s
256 1466.335 ± 42.757 ops/s
512 133.251 ± 4.523 ops/s
1024 10.168 ± 0.286 ops/s
That's it for today :-) There's definitely more work needed to match the
performance of C++ version, but the gap is much narrower now.
To summarize:
(1) Offset computations (both on-heap & off-heap) need some
attention. It shouldn't be as problematic when unrolling happens, but
it's a real problem for scenarios with manual unrolling.
(2) Data alignment matters, but there's a standard way to enforce it
only on DirectByteBuffers
Best regards,
Vladimir Ivanov
[1] https://github.com/richardstartin/vectorbenchmarks
[2]
https://github.com/richardstartin/vectorbenchmarks/blob/master/src/main/java/com/openkappa/panama/vectorbenchmarks/FloatMatrixMatrixMultiplication.java#L45
[3] https://github.com/richardstartin/cppavxbenchmarks
[4]
L1:
vmovaps %xmm7,%xmm5
L2:
vpshufd $0x0,%xmm5,%xmm6
vinsertf128 $0x1,%xmm6,%ymm6,%ymm6
mov %edi,%r10d
imul 0x138(%rsp),%r10d
add %r9d,%r10d
mov %r10d,%ebx
add $0x38,%ebx
movslq %ebx,%rbx
vmovdqu 0x10(%rcx,%rbx,4),%ymm5
vfmadd231ps %ymm6,%ymm5,%ymm3
mov %r10d,%ebx
add $0x30,%ebx
movslq %ebx,%rbx
vmovdqu 0x10(%rcx,%rbx,4),%ymm5
vfmadd231ps %ymm6,%ymm5,%ymm14
mov %r10d,%ebx
add $0x28,%ebx
movslq %ebx,%rbx
vmovdqu 0x10(%rcx,%rbx,4),%ymm5
vfmadd231ps %ymm6,%ymm5,%ymm13
mov %r10d,%ebx
add $0x20,%ebx
movslq %ebx,%rbx
vmovdqu 0x10(%rcx,%rbx,4),%ymm5
vfmadd231ps %ymm6,%ymm5,%ymm12
mov %r10d,%ebx
add $0x18,%ebx
movslq %ebx,%rbx
vmovdqu 0x10(%rcx,%rbx,4),%ymm5
vfmadd231ps %ymm6,%ymm5,%ymm11
mov %r10d,%ebx
add $0x10,%ebx
movslq %ebx,%rbx
vmovdqu 0x10(%rcx,%rbx,4),%ymm5
vfmadd231ps %ymm6,%ymm5,%ymm10
mov %r10d,%ebx
add $0x8,%ebx
movslq %r10d,%r10
vmovdqu 0x10(%rcx,%r10,4),%ymm5
vfmadd231ps %ymm6,%ymm5,%ymm8
movslq %ebx,%r10
vmovdqu 0x10(%rcx,%r10,4),%ymm5
vfmadd231ps %ymm6,%ymm5,%ymm9
inc %edi
cmp %esi,%edi
jge 0x000000010fc79c52
mov %edi,%r10d
add %r11d,%r10d
vmovss 0x10(%rdx,%r10,4),%xmm5
vucomiss %xmm5,%xmm5
jp L1
jne L1
jmpq L2
More information about the panama-dev
mailing list