[vector] Some thoughts on alternative vector loop shapes

Mon Aug 27 12:20:33 UTC 2018

I touched on partial/sub vectors, but there's another use case left.

Richard Starting regularly resorts to manual unrolling of vector loops 
during his experiments with Vector API (e.g., [1] [2] [3]).

It's not because C2 doesn't perform enough unrolling, but to break 
dependencies between iterations which the JIT can't do on its own.

For example, Richard reported [2] 2x improvement (on small input sizes) 
after manually unrolling the loop 4 times:

   var sum = FSPEC.zero();
   for (int i = 0; i < size; i += FSPEC.length()) {
     var l = FSPEC.fromArray(left, i);
     var r = FSPEC.fromArray(right, i);
     sum = l.fma(r, sum);
   }
   return sum.addAll();

vs

   var sum1 = FSPEC.zero();
   ...
   var sum4 = FSPEC.zero();

   int width = FSPEC.length();
   for (int i = 0; i < size; i += width * 4) {
     var l1 = FSPEC.fromArray(left,  i + width * 0);
     var r1 = FSPEC.fromArray(right, i + width * 0);
     sum1 = l1.fma(r1, sum1);
     ...
     var l4 = FSPEC.fromArray(left,  i + width * 3);
     var r4 = FSPEC.fromArray(right, i + width * 3);
     sum4 = l4.fma(r4, sum4);
   }
   return sum1.addAll() + ... + sum4.addAll();

What makes the difference is on every iteration there are independent 4 
vector accumulators being used.

Loop unrolling problem is similar to loop customization problem explored 
earlier. And the same approaches should cover it in a longer term.

Here I'll try to explore how it can be addressed with the current API 
and introduce a concept of "super-vectors". A super-vector is a vector 
which encapsulates a sequence of vectors.

It can represent vectors of a larger size hardware supports:

   class Int1028Vector {
     final Int512Vector lo;
     final Int512Vector hi;
   }

or provide an alternative (more optimal) representation of a vector 
which perfectly fits into a vector register:

   class Int256Vector {
     final Int128Vector lo;
     final Int128Vector hi;
   }

We considered that technique to improve on graceful performance 
degradation goal: if a user explicitly asks for a 512-bit vector on a 
hardware which doesn't have proper support, under the hood it can be 
represented by a pair of 256-bit vectors (or a quad of 128-bit vectors).

It seems such strategy nicely generalizes to cover cases where manual 
unrolling is beneficial right now.

Assume that S1028F represents a shape of 4 256-bit vectors. Then the 
original version of dot product on 1028-bit vectors:

   var sum = S1028F.zero();
   for (int i = 0; i < size; i += S1028F.length()) {
     var l = S1028F.fromArray(left, i);
     var r = S1028F.fromArray(right, i);
     sum = l.fma(r, sum);
   }
   return sum.addAll();

under the hood is represented as a loop on 256-bit vectors unrolled 4 times:

S1028F.length() == 4 * S256F.length()

sum = S1028F.zero();           ==>  sum_0 = S256F.zero();
                                     sum_1 = S256F.zero();
                                     sum_2 = S256F.zero();
                                     sum_3 = S256F.zero();

l = S1028F.fromArray(left, i)  ==> W = S256F.length()
                                    l_0 = S256F.fromArray(left,  i + 0*W)
                                    l_1 = S256F.fromArray(left,  i + 1*W)
                                    l_2 = S256F.fromArray(left,  i + 2*W)
                                    l_3 = S256F.fromArray(left,  i + 3*W)

r = S1028F.fromArray(right, i) ==> r_0 = S256F.fromArray(right, i + 0*W)
                                    r_1 = S256F.fromArray(right, i + 1*W)
                                    r_2 = S256F.fromArray(right, i + 2*W)
                                    r_3 = S256F.fromArray(right, i + 3*W)

sum = l.fma(r, sum)            ==> sum_0 = l_0.fma(r_0, sum_0)
                                    sum_1 = l_1.fma(r_1, sum_1)
                                    sum_2 = l_2.fma(r_2, sum_2)
                                    sum_3 = l_3.fma(r_3, sum_3)

sum.addAll()                   ==> sum_0.addAll() + ... + sum_3.addAll()

Best regards,
Vladimir Ivanov

[1] http://richardstartin.uk/vectorised-algorithms-in-java/
[2] http://richardstartin.uk/vector-api-dot-product/
[3] http://richardstartin.uk/vectorised-polynomial-hash-codes/

On 21/08/2018 21:28, Vladimir Ivanov wrote:
> Hi,
> 
> So far, there was not much attention paid to the shape of vector loops.
> The examples either worked on properly sized data or having explicit 
> tail loop which doesn't look pretty:
> 
>      int i = 0;
>      for (; i + spec.length() < a.length; i += spec.length()) {
>          var va = spec.fromArray(a, i);
>          var vb = spec.fromArray(b, i);
>          var vc = va.mul(vb);
>          vc.intoArray(c, i);
>      }
>      for (; i < a.length; i++) {
>          c[i] = a[i] * b[i];
>      }
> 
> It may look like a crux introduced by x86-centric considerations (where 
> vector sizes are hard-coded in vector instructions), but actually it was 
> not.
> 
> Comparing to what C2 auto-vectorizer does, starting jdk9, it produces 4 
> loops from a single scalar loop:
>    (1) pre-loop  (scalar)
>    (2) main loop (vectorized)
>    (3) post-loop (vectorized)
>    (4) post-loop (scalar)
> 
> And only the last one (#4) is required for correctness, 2 others (#1 and 
> #3) are introduced for performance reasons:
>    * pre-loop achieves alignment of consequent vector accesses;
>    * vectorized post-loop reduces the overhead of scalar version after 
> heavily unrolled main loop.
> 
> I believe that irrespective of the way instructions are encoded, 
> different microarchitectural considerations may justify auxiliary loops 
> to improve performance. It makes ability to automatically derive 
> customized loop versions an important tool to improve performance 
> irrespective of actual hardware.
> 
> There have been attempts to come up with an API to describe kernels in 
> canonical way and there have been different approaches discussed in the 
> past (e.g., Expression Language by Ian Graves and lambda cracking [1] by 
> John Rose), but due to different reasons they haven't made their way 
> into the Vector API yet.
> 
> While waiting for a better solution on language level, is there a better 
> way to represent a vector loop and get it well optimized without 
> repeating the kernel over and over again?
> 
> I'll try to explore an approach based on partial vectors and predicated 
> operations. I'll focus on AVX512 and pre-AVX512 on x86, but I believe it 
> should generalize to SVE as well.
> 
> Let's try to get rid of scalar post-loop and fuse the tail processing 
> into last iteration of vector loop using masks:
> 
>    for (int i = 0; i < a.length; i += spec.length()) {
>        Mask<Float,S> m = spec.index(a.length - i);
>        var va = spec.fromArray(a, i, m);
>        var vb = spec.fromArray(b, i, m);
>        var vc = va.mul(vb,           m);
>        vc.intoArray(c, i,            m);
>    }
> 
> Vector.Species::index(int) computes the set of active elements on each 
> iteration.
> 
> Mask m has very regular shape: it is always all-1s except on the very 
> last iteration when it contains consecutive 1s for the remaining elements.
> 
> After loop unrolling, JIT should be able prove that all masks in main 
> loop are "all 1s" since (a.length - i) is always larger than vector 
> size. Adding an intrinsic for index(...) may help to reliably 
> communicate that invariant to JIT:
> 
>    int i = 0;
>    for (; i < a.length - 4*spec.length(); i += 4*spec.length()) {
>        var m1 = spec.index(a.length - (i + 0*spec.length()));/ / all 1s
>        ...
>        var m2 = spec.index(a.length - (i + 1*spec.length())); // all 1s
>        ...
>        var m3 = spec.index(a.length - (i + 2*spec.length())); // all 1s
>        ...
>        var m4 = spec.index(a.length - (i + 3*spec.length())); // all 1s
>        ...
>    }
>    for (; i < a.length; i += spec.length()) { // post-loop
>        var m  = spec.index(a.length - i);
>        var va = spec.fromArray(a, i, m);
>        var vb = spec.fromArray(b, i, m);
>        var vc = va.mul(vb,            m);
>        vc.intoArray(c, i,             m);
>    }
> 
> ... and all masked operations in main loop can be strength-reduced to 
> non-masked ones:
> 
>    for (; i < a.length - 4*spec.length(); i += 4*spec.length()) {
>        int i0 = i + 0*spec.length();
>        var va0 = spec.fromArray(a, i0);
>        var vb0 = spec.fromArray(b, i0);
>        var vc0 = va0.mul(vb0);
>        vc0.intoArray(c, i0);
> ...
>        int i3 = i + 3*spec.length();
>        var va3 = spec.fromArray(a, i3);
>        var vb3 = spec.fromArray(b, i3);
>        var vc3 = va3.mul(vb3);
>        vc3.intoArray(c, i3);
>    }
>    for (; i < a.length; i += spec.length()) { // post-loop
>        var m  = spec.index(a.length - i);
>        var va = spec.fromArray(a, i, m);
>        var vb = spec.fromArray(b, i, m);
>        var vc = va.mul(vb,           m);
>        vc.intoArray(c, i,            m);
>    }
> 
> Similar trick can be applied to get pre-loop automatically generated.
> 
> After loop peeling, the first iteration can serve as pre-loop:
> 
>    int align = align(a); // align array accesses on cache line boundaries
>    for (int i = 0; i < a.length; i += (i == 0 ? align : spec.length()) {
>        Mask<Float,S> m = spec.index(i == 0 ? align : a.length - i);
>        var va = spec.fromArray(a, i, m);
>        var vb = spec.fromArray(b, i, m);
>        var vc = va.mul(vb,           m);
>        vc.intoArray(c, i,            m);
>    }
> 
> ... turns into:
> 
>    Mask<Float,S256Bit> m = spec.index(align);
>    var va = spec.fromArray(a, i, m);
>    var vb = spec.fromArray(b, i, m);
>    var vc = va.mul(vb,           m);
>    vc.intoArray(c, i,            m);
> 
>    for (int i = align; i < a.length; i += spec.length()) {
>        Mask<Float,S> m = spec.index(a.length - i);
>        var va = spec.fromArray(a, i, m);
>        var vb = spec.fromArray(b, i, m);
>        var vc = va.mul(vb,           m);
>        vc.intoArray(c, i,            m);
>    }
> 
> Such strategy allows to automatically derive 2 auxiliary loops (pre- and 
> post-) during JIT compilation. The only thing which is left uncovered is 
> scalar post loop. It makes it hard for JIT to generate most optimal code 
> for post loop if there's no hardware support of predicated instructions 
> (e.g., pre-AVX512 on x86), but main loop (where most of the work should 
> happen) becomes predicate-free.
> 
> 
> The ugly part on API level is that all operations have to be explicitly 
> masked. It would be nice to have a way to define a context mask to all 
> operations. For example, have vector shapes which encapsulates both 
> vector and a mask:
> 
>    for (int i = align; i < a.length; i += spec.length()) {
>        Mask<Float,S> m = spec.index(a.length - i);
>        FloatSpecies<S> mspec = spec.masked(m); // set context mask m
>        var va = mspec.fromArray(a, i);         // fromArray(a, i, m)
>        var vb = mspec.fromArray(b, i);         // fromArray(b, i, m)
>        var vc = va.mul(vb);                    // va.mul(vb, m)
>        vc.intoArray(c, i);                     // vc.intoArray(c, i, m)
>    }
> 
> Mask becomes part of vector shapes and participates in type checking, so 
> not only type and size should match, but also the mask. Otherwise, an 
> exception is thrown.
> 
> Best regards,
> Vladimir Ivanov
> 
> [1] http://cr.openjdk.java.net/~jrose/panama/token-codes.html