[vector] Some thoughts on alternative vector loop shapes
Monica Beckwith
Monica.Beckwith at arm.com
Thu Aug 30 15:09:48 UTC 2018
Hello Vladimir,
Thank you for starting the mask/predicate discussion for Vector API. This is the direction that also suits Arm SVE.
If we take the predicate route - it can help optimize the pre-loop if we just work the predicate to deal with the unaligned elements. Similarly, we can avoid any checks in the main loop by moving the predicate to the last vector and thus also avoid any "tails". These are very simplified ideas. But the main take-away is that we can do a cleaner implementation if we use the predicate wisely. The vector elements in the predicate can decide what we want to optimize.
By extending the above, we can achieve merging and zeroing as well.
I am including Alejandro, who not only helped us have a great discussion on this during JVMLS, but also who has immensely helped my understanding of SVE. Furthermore, Alejandro is also willing to work with us in developing the predicated/active operations as well as normal vector operations.
Thanks once again.
Monica
On 21/08/2018 21:28, Vladimir Ivanov wrote:
> Hi,
>
> So far, there was not much attention paid to the shape of vector loops.
> The examples either worked on properly sized data or having explicit
> tail loop which doesn't look pretty:
>
> int i = 0;
> for (; i + spec.length() < a.length; i += spec.length()) {
> var va = spec.fromArray(a, i);
> var vb = spec.fromArray(b, i);
> var vc = va.mul(vb);
> vc.intoArray(c, i);
> }
> for (; i < a.length; i++) {
> c[i] = a[i] * b[i];
> }
>
> It may look like a crux introduced by x86-centric considerations (where
> vector sizes are hard-coded in vector instructions), but actually it was
> not.
>
> Comparing to what C2 auto-vectorizer does, starting jdk9, it produces 4
> loops from a single scalar loop:
> (1) pre-loop (scalar)
> (2) main loop (vectorized)
> (3) post-loop (vectorized)
> (4) post-loop (scalar)
>
> And only the last one (#4) is required for correctness, 2 others (#1 and
> #3) are introduced for performance reasons:
> * pre-loop achieves alignment of consequent vector accesses;
> * vectorized post-loop reduces the overhead of scalar version after
> heavily unrolled main loop.
>
> I believe that irrespective of the way instructions are encoded,
> different microarchitectural considerations may justify auxiliary loops
> to improve performance. It makes ability to automatically derive
> customized loop versions an important tool to improve performance
> irrespective of actual hardware.
>
> There have been attempts to come up with an API to describe kernels in
> canonical way and there have been different approaches discussed in the
> past (e.g., Expression Language by Ian Graves and lambda cracking [1] by
> John Rose), but due to different reasons they haven't made their way
> into the Vector API yet.
>
> While waiting for a better solution on language level, is there a better
> way to represent a vector loop and get it well optimized without
> repeating the kernel over and over again?
>
> I'll try to explore an approach based on partial vectors and predicated
> operations. I'll focus on AVX512 and pre-AVX512 on x86, but I believe it
> should generalize to SVE as well.
>
> Let's try to get rid of scalar post-loop and fuse the tail processing
> into last iteration of vector loop using masks:
>
> for (int i = 0; i < a.length; i += spec.length()) {
> Mask<Float,S> m = spec.index(a.length - i);
> var va = spec.fromArray(a, i, m);
> var vb = spec.fromArray(b, i, m);
> var vc = va.mul(vb, m);
> vc.intoArray(c, i, m);
> }
>
> Vector.Species::index(int) computes the set of active elements on each
> iteration.
>
> Mask m has very regular shape: it is always all-1s except on the very
> last iteration when it contains consecutive 1s for the remaining elements.
>
> After loop unrolling, JIT should be able prove that all masks in main
> loop are "all 1s" since (a.length - i) is always larger than vector
> size. Adding an intrinsic for index(...) may help to reliably
> communicate that invariant to JIT:
>
> int i = 0;
> for (; i < a.length - 4*spec.length(); i += 4*spec.length()) {
> var m1 = spec.index(a.length - (i + 0*spec.length()));/ / all 1s
> ...
> var m2 = spec.index(a.length - (i + 1*spec.length())); // all 1s
> ...
> var m3 = spec.index(a.length - (i + 2*spec.length())); // all 1s
> ...
> var m4 = spec.index(a.length - (i + 3*spec.length())); // all 1s
> ...
> }
> for (; i < a.length; i += spec.length()) { // post-loop
> var m = spec.index(a.length - i);
> var va = spec.fromArray(a, i, m);
> var vb = spec.fromArray(b, i, m);
> var vc = va.mul(vb, m);
> vc.intoArray(c, i, m);
> }
>
> ... and all masked operations in main loop can be strength-reduced to
> non-masked ones:
>
> for (; i < a.length - 4*spec.length(); i += 4*spec.length()) {
> int i0 = i + 0*spec.length();
> var va0 = spec.fromArray(a, i0);
> var vb0 = spec.fromArray(b, i0);
> var vc0 = va0.mul(vb0);
> vc0.intoArray(c, i0);
> ...
> int i3 = i + 3*spec.length();
> var va3 = spec.fromArray(a, i3);
> var vb3 = spec.fromArray(b, i3);
> var vc3 = va3.mul(vb3);
> vc3.intoArray(c, i3);
> }
> for (; i < a.length; i += spec.length()) { // post-loop
> var m = spec.index(a.length - i);
> var va = spec.fromArray(a, i, m);
> var vb = spec.fromArray(b, i, m);
> var vc = va.mul(vb, m);
> vc.intoArray(c, i, m);
> }
>
> Similar trick can be applied to get pre-loop automatically generated.
>
> After loop peeling, the first iteration can serve as pre-loop:
>
> int align = align(a); // align array accesses on cache line boundaries
> for (int i = 0; i < a.length; i += (i == 0 ? align : spec.length()) {
> Mask<Float,S> m = spec.index(i == 0 ? align : a.length - i);
> var va = spec.fromArray(a, i, m);
> var vb = spec.fromArray(b, i, m);
> var vc = va.mul(vb, m);
> vc.intoArray(c, i, m);
> }
>
> ... turns into:
>
> Mask<Float,S256Bit> m = spec.index(align);
> var va = spec.fromArray(a, i, m);
> var vb = spec.fromArray(b, i, m);
> var vc = va.mul(vb, m);
> vc.intoArray(c, i, m);
>
> for (int i = align; i < a.length; i += spec.length()) {
> Mask<Float,S> m = spec.index(a.length - i);
> var va = spec.fromArray(a, i, m);
> var vb = spec.fromArray(b, i, m);
> var vc = va.mul(vb, m);
> vc.intoArray(c, i, m);
> }
>
> Such strategy allows to automatically derive 2 auxiliary loops (pre- and
> post-) during JIT compilation. The only thing which is left uncovered is
> scalar post loop. It makes it hard for JIT to generate most optimal code
> for post loop if there's no hardware support of predicated instructions
> (e.g., pre-AVX512 on x86), but main loop (where most of the work should
> happen) becomes predicate-free.
>
>
> The ugly part on API level is that all operations have to be explicitly
> masked. It would be nice to have a way to define a context mask to all
> operations. For example, have vector shapes which encapsulates both
> vector and a mask:
>
> for (int i = align; i < a.length; i += spec.length()) {
> Mask<Float,S> m = spec.index(a.length - i);
> FloatSpecies<S> mspec = spec.masked(m); // set context mask m
> var va = mspec.fromArray(a, i); // fromArray(a, i, m)
> var vb = mspec.fromArray(b, i); // fromArray(b, i, m)
> var vc = va.mul(vb); // va.mul(vb, m)
> vc.intoArray(c, i); // vc.intoArray(c, i, m)
> }
>
> Mask becomes part of vector shapes and participates in type checking, so
> not only type and size should match, but also the mask. Otherwise, an
> exception is thrown.
>
> Best regards,
> Vladimir Ivanov
>
> [1] http://cr.openjdk.java.net/~jrose/panama/token-codes.html
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
More information about the panama-dev
mailing list