[vectorIntrinsics] Feedback on Vector for Image Processing
Paul Sandoz
paul.sandoz at oracle.com
Wed Mar 24 15:41:54 UTC 2021
[adding back panama-dev that got dropped]
Hi Peter,
Excellent, glad you saw some performance improvements.
A base line against C++ would be useful. The kind of specialized matrix multiply implementations I was referring to are implemented in BLIS [0], and I suspect Intel’s MKL library. A nice overview connecting to compilers is here [1]. These algorithms are hard to beat, but I think it is possible to write the specialized kernels using the Vector API [2] which are then used within the larger loops that optimize the data movement. IMHO if one can get within 80% of such native implementations then that’s a win.
Paul.
[0] https://github.com/flame/blis
[1] https://arxiv.org/abs/2003.00532
[2] Here is one example of such a kernel using the Vector API. The code produced by HotSpot is rather good. I have not had time (and likely to not have the time) to wrap that around the higher-level loops.
public static void sgemm_haswell_asm_6x16_vector_api(
int k_iter,
float[] alpha,
float[] a,
int a_offset,
float[] b,
int b_offset,
float[] beta,
float[] c,
int c_offset,
int rs_c,
int cs_c) {
// For now, we ONLY accept row-major layout:
if (cs_c != 1) {
throw new IllegalArgumentException("sgemm_haswell_asm_6x16_simple_C only support row major storage!\n");
}
FloatVector AxB_0L, AxB_0R;
AxB_0L = FloatVector.zero(SPECIES_256);
AxB_0R = FloatVector.zero(SPECIES_256);
FloatVector AxB_1L, AxB_1R;
AxB_1L = FloatVector.zero(SPECIES_256);
AxB_1R = FloatVector.zero(SPECIES_256);
FloatVector AxB_2L, AxB_2R;
AxB_2L = FloatVector.zero(SPECIES_256);
AxB_2R = FloatVector.zero(SPECIES_256);
FloatVector AxB_3L, AxB_3R;
AxB_3L = FloatVector.zero(SPECIES_256);
AxB_3R = FloatVector.zero(SPECIES_256);
FloatVector AxB_4L, AxB_4R;
AxB_4L = FloatVector.zero(SPECIES_256);
AxB_4R = FloatVector.zero(SPECIES_256);
FloatVector AxB_5L, AxB_5R;
AxB_5L = FloatVector.zero(SPECIES_256);
AxB_5R = FloatVector.zero(SPECIES_256);
// Cycle through two rows of B, then repeat:
int b_upper = b_offset + k_iter * 16;
for (int b_iter = b_offset, a_iter = a_offset; b_iter < b_upper; b_iter += 16, a_iter += 6) {
FloatVector B_0L = FloatVector.fromArray(SPECIES_256, b, b_iter);
FloatVector B_0R = FloatVector.fromArray(SPECIES_256, b, b_iter + 8);
FloatVector vbr_a0 = FloatVector.broadcast(SPECIES_256,
a[a_iter + 0]);
FloatVector vbr_a1 = FloatVector.broadcast(SPECIES_256,
a[a_iter + 1]);
// Row 0:
AxB_0L = B_0L.fma(vbr_a0, AxB_0L);
AxB_0R = B_0R.fma(vbr_a0, AxB_0R);
// Row 1:
AxB_1L = B_0L.fma(vbr_a1, AxB_1L);
AxB_1R = B_0R.fma(vbr_a1, AxB_1R);
vbr_a0 = FloatVector.broadcast(SPECIES_256,
a[a_iter + 2]);
vbr_a1 = FloatVector.broadcast(SPECIES_256,
a[a_iter + 3]);
// Row 2:
AxB_2L = B_0L.fma(vbr_a0, AxB_2L);
AxB_2R = B_0R.fma(vbr_a0, AxB_2R);
// Row 3:
AxB_3L = B_0L.fma(vbr_a1, AxB_3L);
AxB_3R = B_0R.fma(vbr_a1, AxB_3R);
vbr_a0 = FloatVector.broadcast(SPECIES_256,
a[a_iter + 4]);
vbr_a1 = FloatVector.broadcast(SPECIES_256,
a[a_iter + 5]);
// Row 4:
AxB_4L = B_0L.fma(vbr_a0, AxB_4L);
AxB_4R = B_0R.fma(vbr_a0, AxB_4R);
// Row 5:
AxB_5L = B_0L.fma(vbr_a1, AxB_5L);
AxB_5R = B_0R.fma(vbr_a1, AxB_5R);
}
// Now do post value adjustment:
// We actually want to compute C = alpha x A x B
// (1) Multiply C by alpha
float fAlpha = alpha[0];
FloatVector vbr_alpha = FloatVector.broadcast(SPECIES_256, fAlpha);
// Unroll scale by alpha:
// Row 0:
AxB_0L = AxB_0L.mul(vbr_alpha);
AxB_0R = AxB_0R.mul(vbr_alpha);
// Row 1:
AxB_1L = AxB_1L.mul(vbr_alpha);
AxB_1R = AxB_1R.mul(vbr_alpha);
// Row 2:
AxB_2L = AxB_2L.mul(vbr_alpha);
AxB_2R = AxB_2R.mul(vbr_alpha);
// Row 3:
AxB_3L = AxB_3L.mul(vbr_alpha);
AxB_3R = AxB_3R.mul(vbr_alpha);
// Row 4:
AxB_4L = AxB_4L.mul(vbr_alpha);
AxB_4R = AxB_4R.mul(vbr_alpha);
// Row 5:
AxB_5L = AxB_5L.mul(vbr_alpha);
AxB_5R = AxB_5R.mul(vbr_alpha);
// (2) To each element, add the original C times beta
// Note that to conserve registers, we do this in a rolling fashion!
float fBeta = beta[0];
if (fBeta != 0.0) {
int c_row_iter = c_offset;
FloatVector vbr_beta = FloatVector.broadcast(SPECIES_256, fBeta);
// Use a temporary register to load the old C from memory!
FloatVector vOldCL = FloatVector.fromArray(SPECIES_256, c, c_row_iter);
FloatVector vOldCR = FloatVector.fromArray(SPECIES_256, c, c_row_iter + 8);
// Row 0: AB += OldC x beta
AxB_0L = vOldCL.fma(vbr_beta, AxB_0L);
AxB_0R = vOldCR.fma(vbr_beta, AxB_0R);
// Save results to free up registers:
AxB_0L.intoArray(c, c_row_iter);
AxB_0R.intoArray(c, c_row_iter + 8);
c_row_iter += rs_c;
// Row 1: AB += OldC x beta
vOldCL = FloatVector.fromArray(SPECIES_256, c, c_row_iter);
vOldCR = FloatVector.fromArray(SPECIES_256, c, c_row_iter + 8);
AxB_1L = vOldCL.fma(vbr_beta, AxB_1L);
AxB_1R = vOldCR.fma(vbr_beta, AxB_1R);
AxB_1L.intoArray(c, c_row_iter);
AxB_1R.intoArray(c, c_row_iter + 8);
c_row_iter += rs_c;
// Row 2: AB += OldC x beta
vOldCL = FloatVector.fromArray(SPECIES_256, c, c_row_iter);
vOldCR = FloatVector.fromArray(SPECIES_256, c, c_row_iter + 8);
AxB_2L = vOldCL.fma(vbr_beta, AxB_2L);
AxB_2R = vOldCR.fma(vbr_beta, AxB_2R);
AxB_2L.intoArray(c, c_row_iter);
AxB_2R.intoArray(c, c_row_iter + 8);
c_row_iter += rs_c;
// Row 3: AB += OldC x beta
vOldCL = FloatVector.fromArray(SPECIES_256, c, c_row_iter);
vOldCR = FloatVector.fromArray(SPECIES_256, c, c_row_iter + 8);
AxB_3L = vOldCL.fma(vbr_beta, AxB_3L);
AxB_3R = vOldCR.fma(vbr_beta, AxB_3R);
AxB_3L.intoArray(c, c_row_iter);
AxB_3R.intoArray(c, c_row_iter + 8);
c_row_iter += rs_c;
// Row 4: AB += OldC x beta
vOldCL = FloatVector.fromArray(SPECIES_256, c, c_row_iter);
vOldCR = FloatVector.fromArray(SPECIES_256, c, c_row_iter + 8);
AxB_4L = vOldCL.fma(vbr_beta, AxB_4L);
AxB_4R = vOldCR.fma(vbr_beta, AxB_4R);
AxB_4L.intoArray(c, c_row_iter);
AxB_4R.intoArray(c, c_row_iter + 8);
c_row_iter += rs_c;
// Row 5: AB += OldC x beta
vOldCL = FloatVector.fromArray(SPECIES_256, c, c_row_iter);
vOldCR = FloatVector.fromArray(SPECIES_256, c, c_row_iter + 8);
AxB_5L = vOldCL.fma(vbr_beta, AxB_5L);
AxB_5R = vOldCR.fma(vbr_beta, AxB_5R);
AxB_5L.intoArray(c, c_row_iter);
AxB_5R.intoArray(c, c_row_iter + 8);
}
// beta is zero, just save out AB:
else {
int c_row_iter = c_offset;
AxB_0L.intoArray(c, c_row_iter);
AxB_0R.intoArray(c, c_row_iter + 8);
c_row_iter += rs_c;
AxB_1L.intoArray(c, c_row_iter);
AxB_1R.intoArray(c, c_row_iter + 8);
c_row_iter += rs_c;
AxB_2L.intoArray(c, c_row_iter);
AxB_2R.intoArray(c, c_row_iter + 8);
c_row_iter += rs_c;
AxB_3L.intoArray(c, c_row_iter);
AxB_3R.intoArray(c, c_row_iter + 8);
c_row_iter += rs_c;
AxB_4L.intoArray(c, c_row_iter);
AxB_4R.intoArray(c, c_row_iter + 8);
c_row_iter += rs_c;
AxB_5L.intoArray(c, c_row_iter);
AxB_5R.intoArray(c, c_row_iter + 8);
}
}
> On Mar 23, 2021, at 2:17 PM, Peter A <peter.abeles at gmail.com> wrote:
>
> Hi Paul,
>
> Thanks for the reply! Just updated the readme with latest results and your vectorized binarization is 6.78 times faster than the baseline code. I'm planning on adding a few more functions and will post here when I do.
>
> Would a C++ implementation of the same matrix multiplication algorithm be of interest for setting an upper limit on expected performance? The magical 2.5x faster I mentioned is about what I see in highly optimized C++ code. Hard to make generalizations across hardware, matrix size, and I'm not sure how much manual vectorization is involved. For this test I intentionally picked a matrix size which wasn't trivial but not so large that cache optimization dominates. EJML has many different implementations of matrix multiplication for the reasons you cited.
>
> With BoofCV you are close. BoofCV looks at the kernel size and if it has autogenerated unrolled code for that size it will invoke that function. If it doesn't then to uses generalized code which is similar to what you see here. For this benchmark I disabled threads in BoofCV, but in general if you have it turned on it will use it even if it's too small. Some parts of the code do have sanity checks on input size before it invokes the concurrent code, but not this part.
>
> https://github.com/lessthanoptimal/BoofCV/blob/SNAPSHOT/main/boofcv-ip/src/main/java/boofcv/alg/filter/convolve/noborder/ConvolveImageUnrolled_SB_F32_F32.java
>
> - Peter
>
> On Tue, Mar 23, 2021 at 12:55 PM Paul Sandoz <paul.sandoz at oracle.com> wrote:
> Hi,
>
> I went through your examples in more detail.
>
> As you noted, It is planned to support unsigned comparison operators.
>
> I’ll respond to the observations you made in the README.MD [1] which should also answer the points below.
>
> Small matrixes
> —
> I don’t think the problem is allocation (or more specially the vector operations not being compiled into vector registers and hardware instructions), otherwise you would likely see a much large slower down that the scalar code.
> Instead I think there is some additional cost to manage the two loops, the vector loop and the tail. The vector loop probably only has two iterations (for 10 columns), leaving two iterations for the scalar tail loop. That extra bookkeeping likely explains the different you are observing. FWIW when I added vectorized mismatch support in the JDK I added a threshold under which the intrinsic is not called.
> As you point it is often the case for smaller inputs a different algorithm is employed.
> In general matrix multiplication algorithms can get quite sophisticated in in their data movement to leverage caches effectively and in the kernels to maximize the use of vector registers, and that requires different approaches for small and large data, which in turn affects the kernels used, thereby reaching close to the theoretical maximum Gflops of a machine i.e. I think it is necessary to specialize based on data input size. Compilers are getting more sophisticated at generating this kind of code (see MLIR) but they still have a way to go, and it would certainly be a challenge for HotSpot.
>
>
> Converting masks to vectors
> —
>
> There is the method VectorMask.toVector, which is implemented as a blend of 0 and -1 (all bits set), so the same technique can be used explicitly as I showed in my prior email.
>
>
> Unrolling
> —
>
> HotSpot can unroll some vector loops (in a similar manner is it unrolls scalar loops from which it can then auto-vectorize). But, unfortunately, that is limited and it cannot do so for reductions IIRC. You need to do that explicitly. On approach is to perform the vector reduction outside the loop and unroll with two or more accumulator vectors. Obviously that only works for larger inputs. That is still explicit but can break the data dependency and latencies with certain instructions.
>
> I don’t know all the specifics of what boofcv to comment in detail, but it appears to specialize for a fixed set of kernel sizes, and then use thread parallelism for larger sizes?
>
> It’s basically an open area of investigation how to better unroll, and what if any might be the API requirements are to help express that reliably to the compiler.
>
> Paul.
>
> [1] https://github.com/lessthanoptimal/VectorPerformance/blob/master/README.MD
>
>
> > On Mar 22, 2021, at 12:39 PM, Peter A <peter.abeles at gmail.com> wrote:
> >
> > I ported a few already optimized functions related to matrix multiplication
> > and image processing to the Vector API and posted the results here:
> >
> > https://github.com/lessthanoptimal/VectorPerformance
> >
> > Results look fairly good! In most cases performance was sped up by about
> > 1.7x, in a few cases it did get worse. I'll just discuss image processing
> > here since I don't think this use case has come up yet.
> >
> > 1) Support for Comparison operators, support unsigned byte and unsigned
> > short type. Based on comments in the JDK looks like this is planned. This
> > is a critical requirement for image processing.
> >
> > 2) Add support for output to the same primitive type as the input array for
> > Comparison operators. Right now there's only support boolean[]. booleans
> > are not ideal for image processing which is why BoofCV uses byte[] for it's
> > binary images.
> >
> > 3) Add a new lower level API which enables (nearly) allocation free usage.
> > Forcing memory allocations inside the inner post loop kills
> > performance, even if the code looks more elegant and is the Java way. This
> > is especially true for code which is optimized for small arrays. You can
> > see this in Linear Algebra libraries where all the highly performant ones
> > are basically written like C libraries in their lowest level functions.
> > Might be best to create a new thread for this comment. Could be an "easy"
> > 30% performance boost.
> >
> > Would also like to point out how much faster the manually unrolled image
> > convolution code was than even the Vectorized version.
> >
> > Cheers,
> > - Peter
> >
> > --
> > "Now, now my good man, this is no time for making enemies." — Voltaire
> > (1694-1778), on his deathbed in response to a priest asking that he
> > renounce Satan.
>
>
>
> --
> "Now, now my good man, this is no time for making enemies." — Voltaire (1694-1778), on his deathbed in response to a priest asking that he renounce Satan.
More information about the panama-dev
mailing list