[vector api] Massive benchmark result and some thoughts (and questions!) about Vector API

Mon May 13 15:48:30 UTC 2019

 First of all, sorry for long message, but I hope it will be useful.

 Second, I'm very pleased by Vector API itself and it current
performance. Great job, thank you!

 Now to details.

 I've finished a massive benchmarking of my code written with Vector
API. It takes more than 5 days of pure CPU time. You could find code and
description of operations (which do all these names mean) here:

https://github.com/blacklion/panama-benchmarks/tree/master/vector

 And results here:

https://docs.google.com/spreadsheets/d/13obJR6I-1K8IEwrFpIrzvF2XrfEHa2NbOXIGE3VRzas/edit?usp=sharing

 This document contains 5 sheets — two with raw data and three with
re-formatted data, which make analysis simpler.

 All benchmarks were performed on Windows 10, with JDK built from
"foreign+intrinsics" branch, commit 6a27ea0ccb81. Hardware is i7-6700K
CPU locked at 4.0Ghz (all energy savings and Turbo Boost were turned
off, multiplier fixed at '40' for all cores), with one thread (to avoid
overheating for sure).

 Baseline is simple pure-old-Java code with tight loops (without manual
unrolling).

 All benchmarks process vectors of size [almost] 65536, where one
element of vector is one real or complex number, depending on operation.
One real number is `float` and one complex number is pair of `float`, so
65536 elements is either 65536 floats or 131072 floats, depending on
operation.

 There is two batches of benchmarks:

 First run investigates dependency between speed and batch size. Each
operation process 65536 ellements not in one call to low-level tight
loop, but in portions of different size. This portions are called
`callSize` in benchmark, and I benchmarked sizes 3, 4, 7, 8, 15, 128,
1024, 65536.

 When 65536 is not divisible by `callSize` (3, 7, 15), slightly less
data is processed, of course.

 Second run investigates dependency between speed and offset from start
of Java array. Offsets 0..7 are used, but these results are not very
interesting.

 Best way to start on these results are page "Variable callSize —
Aggregated": it shows difference between baseline and vector
implementations for each callSize. Columns "difference" show "(vector -
baseline)/baseline", so "+100%" is x2. It is color-coded.

 There are some my thoughts and questions:

(1) Don't try to outperform `System.arraycopy()` ;-)

(2) Looks like simple operations are vectorized by C2: `rv_dot_cv` or
`cv_sum`, for example, is not much faster in vector variant. `rv_dot_rv`
is another example, it was hard to make it faster

(3) `atan2` and `hypot` are DEAD SLOW in `Math` package. Look at
`cv_abs` or `cv_args` which are oh-my-god-incredibility-fast in Vector
variant.

(4) I'm VERY surprised by unexpected (to me) behavior of horizontal
operations like `addLines()` and `maxLines()`. You could see different
variants of horizontal summation here:

https://github.com/blacklion/panama-benchmarks/blob/master/vector/src/jmh/java/vector/specific/RVsum.java

 I've been sure, that vector accumulator with only one horizontal
summation at the end of loop will be fastest, but NO! Fastest variant
uses `addLanes()` in tight loop! Why is this?

(5) Looks like Vector API should communicate with C2 loop unroller
better. You could see at

 https://github.com/blacklion/panama-benchmarks/blob/master/vector/src/jmh/java/vector/specific/RVdotRV.java

 that best variant is heavily-manually-unrolled one
(saccum_unroll_4_2_fma_add_lanes), but it looks very-very ugly.

  BTW, this benchmark shows, that vector accumulator and single
`addLanes()` is again slowest method to do job. It is very surprising
for me.

-- 
// Black Lion AKA Lev Serebryakov