[vector] Some thoughts on alternative vector loop shapes

Tue Aug 21 18:28:58 UTC 2018

Hi,

So far, there was not much attention paid to the shape of vector loops.
The examples either worked on properly sized data or having explicit 
tail loop which doesn't look pretty:

     int i = 0;
     for (; i + spec.length() < a.length; i += spec.length()) {
         var va = spec.fromArray(a, i);
         var vb = spec.fromArray(b, i);
         var vc = va.mul(vb);
         vc.intoArray(c, i);
     }
     for (; i < a.length; i++) {
         c[i] = a[i] * b[i];
     }

It may look like a crux introduced by x86-centric considerations (where 
vector sizes are hard-coded in vector instructions), but actually it was 
not.

Comparing to what C2 auto-vectorizer does, starting jdk9, it produces 4 
loops from a single scalar loop:
   (1) pre-loop  (scalar)
   (2) main loop (vectorized)
   (3) post-loop (vectorized)
   (4) post-loop (scalar)

And only the last one (#4) is required for correctness, 2 others (#1 and 
#3) are introduced for performance reasons:
   * pre-loop achieves alignment of consequent vector accesses;
   * vectorized post-loop reduces the overhead of scalar version after 
heavily unrolled main loop.

I believe that irrespective of the way instructions are encoded, 
different microarchitectural considerations may justify auxiliary loops 
to improve performance. It makes ability to automatically derive 
customized loop versions an important tool to improve performance 
irrespective of actual hardware.

There have been attempts to come up with an API to describe kernels in 
canonical way and there have been different approaches discussed in the 
past (e.g., Expression Language by Ian Graves and lambda cracking [1] by 
John Rose), but due to different reasons they haven't made their way 
into the Vector API yet.

While waiting for a better solution on language level, is there a better 
way to represent a vector loop and get it well optimized without 
repeating the kernel over and over again?

I'll try to explore an approach based on partial vectors and predicated 
operations. I'll focus on AVX512 and pre-AVX512 on x86, but I believe it 
should generalize to SVE as well.

Let's try to get rid of scalar post-loop and fuse the tail processing 
into last iteration of vector loop using masks:

   for (int i = 0; i < a.length; i += spec.length()) {
       Mask<Float,S> m = spec.index(a.length - i);
       var va = spec.fromArray(a, i, m);
       var vb = spec.fromArray(b, i, m);
       var vc = va.mul(vb,           m);
       vc.intoArray(c, i,            m);
   }

Vector.Species::index(int) computes the set of active elements on each 
iteration.

Mask m has very regular shape: it is always all-1s except on the very 
last iteration when it contains consecutive 1s for the remaining elements.

After loop unrolling, JIT should be able prove that all masks in main 
loop are "all 1s" since (a.length - i) is always larger than vector 
size. Adding an intrinsic for index(...) may help to reliably 
communicate that invariant to JIT:

   int i = 0;
   for (; i < a.length - 4*spec.length(); i += 4*spec.length()) {
       var m1 = spec.index(a.length - (i + 0*spec.length()));/ / all 1s
       ...
       var m2 = spec.index(a.length - (i + 1*spec.length())); // all 1s
       ...	
       var m3 = spec.index(a.length - (i + 2*spec.length())); // all 1s
       ...
       var m4 = spec.index(a.length - (i + 3*spec.length())); // all 1s
       ...
   }
   for (; i < a.length; i += spec.length()) { // post-loop
       var m  = spec.index(a.length - i);
       var va = spec.fromArray(a, i, m);
       var vb = spec.fromArray(b, i, m);
       var vc = va.mul(vb,            m);
       vc.intoArray(c, i,             m);
   }

... and all masked operations in main loop can be strength-reduced to 
non-masked ones:

   for (; i < a.length - 4*spec.length(); i += 4*spec.length()) {
       int i0 = i + 0*spec.length();
       var va0 = spec.fromArray(a, i0);
       var vb0 = spec.fromArray(b, i0);
       var vc0 = va0.mul(vb0);
       vc0.intoArray(c, i0);
...
       int i3 = i + 3*spec.length();
       var va3 = spec.fromArray(a, i3);
       var vb3 = spec.fromArray(b, i3);
       var vc3 = va3.mul(vb3);
       vc3.intoArray(c, i3);
   }
   for (; i < a.length; i += spec.length()) { // post-loop
       var m  = spec.index(a.length - i);
       var va = spec.fromArray(a, i, m);
       var vb = spec.fromArray(b, i, m);
       var vc = va.mul(vb,           m);
       vc.intoArray(c, i,            m);
   }

Similar trick can be applied to get pre-loop automatically generated.

After loop peeling, the first iteration can serve as pre-loop:

   int align = align(a); // align array accesses on cache line boundaries
   for (int i = 0; i < a.length; i += (i == 0 ? align : spec.length()) {
       Mask<Float,S> m = spec.index(i == 0 ? align : a.length - i);
       var va = spec.fromArray(a, i, m);
       var vb = spec.fromArray(b, i, m);
       var vc = va.mul(vb,           m);
       vc.intoArray(c, i,            m);
   }

... turns into:

   Mask<Float,S256Bit> m = spec.index(align);
   var va = spec.fromArray(a, i, m);
   var vb = spec.fromArray(b, i, m);
   var vc = va.mul(vb,           m);
   vc.intoArray(c, i,            m);

   for (int i = align; i < a.length; i += spec.length()) {
       Mask<Float,S> m = spec.index(a.length - i);
       var va = spec.fromArray(a, i, m);
       var vb = spec.fromArray(b, i, m);
       var vc = va.mul(vb,           m);
       vc.intoArray(c, i,            m);
   }

Such strategy allows to automatically derive 2 auxiliary loops (pre- and 
post-) during JIT compilation. The only thing which is left uncovered is 
scalar post loop. It makes it hard for JIT to generate most optimal code 
for post loop if there's no hardware support of predicated instructions 
(e.g., pre-AVX512 on x86), but main loop (where most of the work should 
happen) becomes predicate-free.

The ugly part on API level is that all operations have to be explicitly 
masked. It would be nice to have a way to define a context mask to all 
operations. For example, have vector shapes which encapsulates both 
vector and a mask:

   for (int i = align; i < a.length; i += spec.length()) {
       Mask<Float,S> m = spec.index(a.length - i);
       FloatSpecies<S> mspec = spec.masked(m); // set context mask m
       var va = mspec.fromArray(a, i);         // fromArray(a, i, m)
       var vb = mspec.fromArray(b, i);         // fromArray(b, i, m)
       var vc = va.mul(vb);                    // va.mul(vb, m)
       vc.intoArray(c, i);                     // vc.intoArray(c, i, m)
   }

Mask becomes part of vector shapes and participates in type checking, so 
not only type and size should match, but also the mask. Otherwise, an 
exception is thrown.

Best regards,
Vladimir Ivanov

[1] http://cr.openjdk.java.net/~jrose/panama/token-codes.html