[vectorIntrinsics] Feedback on Vector for Image Processing
Paul Sandoz
paul.sandoz at oracle.com
Tue Mar 23 19:55:19 UTC 2021
Hi,
I went through your examples in more detail.
As you noted, It is planned to support unsigned comparison operators.
I’ll respond to the observations you made in the README.MD [1] which should also answer the points below.
Small matrixes
—
I don’t think the problem is allocation (or more specially the vector operations not being compiled into vector registers and hardware instructions), otherwise you would likely see a much large slower down that the scalar code.
Instead I think there is some additional cost to manage the two loops, the vector loop and the tail. The vector loop probably only has two iterations (for 10 columns), leaving two iterations for the scalar tail loop. That extra bookkeeping likely explains the different you are observing. FWIW when I added vectorized mismatch support in the JDK I added a threshold under which the intrinsic is not called.
As you point it is often the case for smaller inputs a different algorithm is employed.
In general matrix multiplication algorithms can get quite sophisticated in in their data movement to leverage caches effectively and in the kernels to maximize the use of vector registers, and that requires different approaches for small and large data, which in turn affects the kernels used, thereby reaching close to the theoretical maximum Gflops of a machine i.e. I think it is necessary to specialize based on data input size. Compilers are getting more sophisticated at generating this kind of code (see MLIR) but they still have a way to go, and it would certainly be a challenge for HotSpot.
Converting masks to vectors
—
There is the method VectorMask.toVector, which is implemented as a blend of 0 and -1 (all bits set), so the same technique can be used explicitly as I showed in my prior email.
Unrolling
—
HotSpot can unroll some vector loops (in a similar manner is it unrolls scalar loops from which it can then auto-vectorize). But, unfortunately, that is limited and it cannot do so for reductions IIRC. You need to do that explicitly. On approach is to perform the vector reduction outside the loop and unroll with two or more accumulator vectors. Obviously that only works for larger inputs. That is still explicit but can break the data dependency and latencies with certain instructions.
I don’t know all the specifics of what boofcv to comment in detail, but it appears to specialize for a fixed set of kernel sizes, and then use thread parallelism for larger sizes?
It’s basically an open area of investigation how to better unroll, and what if any might be the API requirements are to help express that reliably to the compiler.
Paul.
[1] https://github.com/lessthanoptimal/VectorPerformance/blob/master/README.MD
> On Mar 22, 2021, at 12:39 PM, Peter A <peter.abeles at gmail.com> wrote:
>
> I ported a few already optimized functions related to matrix multiplication
> and image processing to the Vector API and posted the results here:
>
> https://github.com/lessthanoptimal/VectorPerformance
>
> Results look fairly good! In most cases performance was sped up by about
> 1.7x, in a few cases it did get worse. I'll just discuss image processing
> here since I don't think this use case has come up yet.
>
> 1) Support for Comparison operators, support unsigned byte and unsigned
> short type. Based on comments in the JDK looks like this is planned. This
> is a critical requirement for image processing.
>
> 2) Add support for output to the same primitive type as the input array for
> Comparison operators. Right now there's only support boolean[]. booleans
> are not ideal for image processing which is why BoofCV uses byte[] for it's
> binary images.
>
> 3) Add a new lower level API which enables (nearly) allocation free usage.
> Forcing memory allocations inside the inner post loop kills
> performance, even if the code looks more elegant and is the Java way. This
> is especially true for code which is optimized for small arrays. You can
> see this in Linear Algebra libraries where all the highly performant ones
> are basically written like C libraries in their lowest level functions.
> Might be best to create a new thread for this comment. Could be an "easy"
> 30% performance boost.
>
> Would also like to point out how much faster the manually unrolled image
> convolution code was than even the Vectorized version.
>
> Cheers,
> - Peter
>
> --
> "Now, now my good man, this is no time for making enemies." — Voltaire
> (1694-1778), on his deathbed in response to a priest asking that he
> renounce Satan.
More information about the panama-dev
mailing list