[vector] Some thoughts on alternative vector loop shapes
Vladimir Ivanov
vladimir.x.ivanov at oracle.com
Tue Aug 21 18:28:58 UTC 2018
Hi,
So far, there was not much attention paid to the shape of vector loops.
The examples either worked on properly sized data or having explicit
tail loop which doesn't look pretty:
int i = 0;
for (; i + spec.length() < a.length; i += spec.length()) {
var va = spec.fromArray(a, i);
var vb = spec.fromArray(b, i);
var vc = va.mul(vb);
vc.intoArray(c, i);
}
for (; i < a.length; i++) {
c[i] = a[i] * b[i];
}
It may look like a crux introduced by x86-centric considerations (where
vector sizes are hard-coded in vector instructions), but actually it was
not.
Comparing to what C2 auto-vectorizer does, starting jdk9, it produces 4
loops from a single scalar loop:
(1) pre-loop (scalar)
(2) main loop (vectorized)
(3) post-loop (vectorized)
(4) post-loop (scalar)
And only the last one (#4) is required for correctness, 2 others (#1 and
#3) are introduced for performance reasons:
* pre-loop achieves alignment of consequent vector accesses;
* vectorized post-loop reduces the overhead of scalar version after
heavily unrolled main loop.
I believe that irrespective of the way instructions are encoded,
different microarchitectural considerations may justify auxiliary loops
to improve performance. It makes ability to automatically derive
customized loop versions an important tool to improve performance
irrespective of actual hardware.
There have been attempts to come up with an API to describe kernels in
canonical way and there have been different approaches discussed in the
past (e.g., Expression Language by Ian Graves and lambda cracking [1] by
John Rose), but due to different reasons they haven't made their way
into the Vector API yet.
While waiting for a better solution on language level, is there a better
way to represent a vector loop and get it well optimized without
repeating the kernel over and over again?
I'll try to explore an approach based on partial vectors and predicated
operations. I'll focus on AVX512 and pre-AVX512 on x86, but I believe it
should generalize to SVE as well.
Let's try to get rid of scalar post-loop and fuse the tail processing
into last iteration of vector loop using masks:
for (int i = 0; i < a.length; i += spec.length()) {
Mask<Float,S> m = spec.index(a.length - i);
var va = spec.fromArray(a, i, m);
var vb = spec.fromArray(b, i, m);
var vc = va.mul(vb, m);
vc.intoArray(c, i, m);
}
Vector.Species::index(int) computes the set of active elements on each
iteration.
Mask m has very regular shape: it is always all-1s except on the very
last iteration when it contains consecutive 1s for the remaining elements.
After loop unrolling, JIT should be able prove that all masks in main
loop are "all 1s" since (a.length - i) is always larger than vector
size. Adding an intrinsic for index(...) may help to reliably
communicate that invariant to JIT:
int i = 0;
for (; i < a.length - 4*spec.length(); i += 4*spec.length()) {
var m1 = spec.index(a.length - (i + 0*spec.length()));/ / all 1s
...
var m2 = spec.index(a.length - (i + 1*spec.length())); // all 1s
...
var m3 = spec.index(a.length - (i + 2*spec.length())); // all 1s
...
var m4 = spec.index(a.length - (i + 3*spec.length())); // all 1s
...
}
for (; i < a.length; i += spec.length()) { // post-loop
var m = spec.index(a.length - i);
var va = spec.fromArray(a, i, m);
var vb = spec.fromArray(b, i, m);
var vc = va.mul(vb, m);
vc.intoArray(c, i, m);
}
... and all masked operations in main loop can be strength-reduced to
non-masked ones:
for (; i < a.length - 4*spec.length(); i += 4*spec.length()) {
int i0 = i + 0*spec.length();
var va0 = spec.fromArray(a, i0);
var vb0 = spec.fromArray(b, i0);
var vc0 = va0.mul(vb0);
vc0.intoArray(c, i0);
...
int i3 = i + 3*spec.length();
var va3 = spec.fromArray(a, i3);
var vb3 = spec.fromArray(b, i3);
var vc3 = va3.mul(vb3);
vc3.intoArray(c, i3);
}
for (; i < a.length; i += spec.length()) { // post-loop
var m = spec.index(a.length - i);
var va = spec.fromArray(a, i, m);
var vb = spec.fromArray(b, i, m);
var vc = va.mul(vb, m);
vc.intoArray(c, i, m);
}
Similar trick can be applied to get pre-loop automatically generated.
After loop peeling, the first iteration can serve as pre-loop:
int align = align(a); // align array accesses on cache line boundaries
for (int i = 0; i < a.length; i += (i == 0 ? align : spec.length()) {
Mask<Float,S> m = spec.index(i == 0 ? align : a.length - i);
var va = spec.fromArray(a, i, m);
var vb = spec.fromArray(b, i, m);
var vc = va.mul(vb, m);
vc.intoArray(c, i, m);
}
... turns into:
Mask<Float,S256Bit> m = spec.index(align);
var va = spec.fromArray(a, i, m);
var vb = spec.fromArray(b, i, m);
var vc = va.mul(vb, m);
vc.intoArray(c, i, m);
for (int i = align; i < a.length; i += spec.length()) {
Mask<Float,S> m = spec.index(a.length - i);
var va = spec.fromArray(a, i, m);
var vb = spec.fromArray(b, i, m);
var vc = va.mul(vb, m);
vc.intoArray(c, i, m);
}
Such strategy allows to automatically derive 2 auxiliary loops (pre- and
post-) during JIT compilation. The only thing which is left uncovered is
scalar post loop. It makes it hard for JIT to generate most optimal code
for post loop if there's no hardware support of predicated instructions
(e.g., pre-AVX512 on x86), but main loop (where most of the work should
happen) becomes predicate-free.
The ugly part on API level is that all operations have to be explicitly
masked. It would be nice to have a way to define a context mask to all
operations. For example, have vector shapes which encapsulates both
vector and a mask:
for (int i = align; i < a.length; i += spec.length()) {
Mask<Float,S> m = spec.index(a.length - i);
FloatSpecies<S> mspec = spec.masked(m); // set context mask m
var va = mspec.fromArray(a, i); // fromArray(a, i, m)
var vb = mspec.fromArray(b, i); // fromArray(b, i, m)
var vc = va.mul(vb); // va.mul(vb, m)
vc.intoArray(c, i); // vc.intoArray(c, i, m)
}
Mask becomes part of vector shapes and participates in type checking, so
not only type and size should match, but also the mask. Otherwise, an
exception is thrown.
Best regards,
Vladimir Ivanov
[1] http://cr.openjdk.java.net/~jrose/panama/token-codes.html
More information about the panama-dev
mailing list