[vector] Some thoughts on alternative vector loop shapes
Vladimir Ivanov
vladimir.x.ivanov at oracle.com
Mon Aug 27 12:20:33 UTC 2018
I touched on partial/sub vectors, but there's another use case left.
Richard Starting regularly resorts to manual unrolling of vector loops
during his experiments with Vector API (e.g., [1] [2] [3]).
It's not because C2 doesn't perform enough unrolling, but to break
dependencies between iterations which the JIT can't do on its own.
For example, Richard reported [2] 2x improvement (on small input sizes)
after manually unrolling the loop 4 times:
var sum = FSPEC.zero();
for (int i = 0; i < size; i += FSPEC.length()) {
var l = FSPEC.fromArray(left, i);
var r = FSPEC.fromArray(right, i);
sum = l.fma(r, sum);
}
return sum.addAll();
vs
var sum1 = FSPEC.zero();
...
var sum4 = FSPEC.zero();
int width = FSPEC.length();
for (int i = 0; i < size; i += width * 4) {
var l1 = FSPEC.fromArray(left, i + width * 0);
var r1 = FSPEC.fromArray(right, i + width * 0);
sum1 = l1.fma(r1, sum1);
...
var l4 = FSPEC.fromArray(left, i + width * 3);
var r4 = FSPEC.fromArray(right, i + width * 3);
sum4 = l4.fma(r4, sum4);
}
return sum1.addAll() + ... + sum4.addAll();
What makes the difference is on every iteration there are independent 4
vector accumulators being used.
Loop unrolling problem is similar to loop customization problem explored
earlier. And the same approaches should cover it in a longer term.
Here I'll try to explore how it can be addressed with the current API
and introduce a concept of "super-vectors". A super-vector is a vector
which encapsulates a sequence of vectors.
It can represent vectors of a larger size hardware supports:
class Int1028Vector {
final Int512Vector lo;
final Int512Vector hi;
}
or provide an alternative (more optimal) representation of a vector
which perfectly fits into a vector register:
class Int256Vector {
final Int128Vector lo;
final Int128Vector hi;
}
We considered that technique to improve on graceful performance
degradation goal: if a user explicitly asks for a 512-bit vector on a
hardware which doesn't have proper support, under the hood it can be
represented by a pair of 256-bit vectors (or a quad of 128-bit vectors).
It seems such strategy nicely generalizes to cover cases where manual
unrolling is beneficial right now.
Assume that S1028F represents a shape of 4 256-bit vectors. Then the
original version of dot product on 1028-bit vectors:
var sum = S1028F.zero();
for (int i = 0; i < size; i += S1028F.length()) {
var l = S1028F.fromArray(left, i);
var r = S1028F.fromArray(right, i);
sum = l.fma(r, sum);
}
return sum.addAll();
under the hood is represented as a loop on 256-bit vectors unrolled 4 times:
S1028F.length() == 4 * S256F.length()
sum = S1028F.zero(); ==> sum_0 = S256F.zero();
sum_1 = S256F.zero();
sum_2 = S256F.zero();
sum_3 = S256F.zero();
l = S1028F.fromArray(left, i) ==> W = S256F.length()
l_0 = S256F.fromArray(left, i + 0*W)
l_1 = S256F.fromArray(left, i + 1*W)
l_2 = S256F.fromArray(left, i + 2*W)
l_3 = S256F.fromArray(left, i + 3*W)
r = S1028F.fromArray(right, i) ==> r_0 = S256F.fromArray(right, i + 0*W)
r_1 = S256F.fromArray(right, i + 1*W)
r_2 = S256F.fromArray(right, i + 2*W)
r_3 = S256F.fromArray(right, i + 3*W)
sum = l.fma(r, sum) ==> sum_0 = l_0.fma(r_0, sum_0)
sum_1 = l_1.fma(r_1, sum_1)
sum_2 = l_2.fma(r_2, sum_2)
sum_3 = l_3.fma(r_3, sum_3)
sum.addAll() ==> sum_0.addAll() + ... + sum_3.addAll()
Best regards,
Vladimir Ivanov
[1] http://richardstartin.uk/vectorised-algorithms-in-java/
[2] http://richardstartin.uk/vector-api-dot-product/
[3] http://richardstartin.uk/vectorised-polynomial-hash-codes/
On 21/08/2018 21:28, Vladimir Ivanov wrote:
> Hi,
>
> So far, there was not much attention paid to the shape of vector loops.
> The examples either worked on properly sized data or having explicit
> tail loop which doesn't look pretty:
>
> int i = 0;
> for (; i + spec.length() < a.length; i += spec.length()) {
> var va = spec.fromArray(a, i);
> var vb = spec.fromArray(b, i);
> var vc = va.mul(vb);
> vc.intoArray(c, i);
> }
> for (; i < a.length; i++) {
> c[i] = a[i] * b[i];
> }
>
> It may look like a crux introduced by x86-centric considerations (where
> vector sizes are hard-coded in vector instructions), but actually it was
> not.
>
> Comparing to what C2 auto-vectorizer does, starting jdk9, it produces 4
> loops from a single scalar loop:
> (1) pre-loop (scalar)
> (2) main loop (vectorized)
> (3) post-loop (vectorized)
> (4) post-loop (scalar)
>
> And only the last one (#4) is required for correctness, 2 others (#1 and
> #3) are introduced for performance reasons:
> * pre-loop achieves alignment of consequent vector accesses;
> * vectorized post-loop reduces the overhead of scalar version after
> heavily unrolled main loop.
>
> I believe that irrespective of the way instructions are encoded,
> different microarchitectural considerations may justify auxiliary loops
> to improve performance. It makes ability to automatically derive
> customized loop versions an important tool to improve performance
> irrespective of actual hardware.
>
> There have been attempts to come up with an API to describe kernels in
> canonical way and there have been different approaches discussed in the
> past (e.g., Expression Language by Ian Graves and lambda cracking [1] by
> John Rose), but due to different reasons they haven't made their way
> into the Vector API yet.
>
> While waiting for a better solution on language level, is there a better
> way to represent a vector loop and get it well optimized without
> repeating the kernel over and over again?
>
> I'll try to explore an approach based on partial vectors and predicated
> operations. I'll focus on AVX512 and pre-AVX512 on x86, but I believe it
> should generalize to SVE as well.
>
> Let's try to get rid of scalar post-loop and fuse the tail processing
> into last iteration of vector loop using masks:
>
> for (int i = 0; i < a.length; i += spec.length()) {
> Mask<Float,S> m = spec.index(a.length - i);
> var va = spec.fromArray(a, i, m);
> var vb = spec.fromArray(b, i, m);
> var vc = va.mul(vb, m);
> vc.intoArray(c, i, m);
> }
>
> Vector.Species::index(int) computes the set of active elements on each
> iteration.
>
> Mask m has very regular shape: it is always all-1s except on the very
> last iteration when it contains consecutive 1s for the remaining elements.
>
> After loop unrolling, JIT should be able prove that all masks in main
> loop are "all 1s" since (a.length - i) is always larger than vector
> size. Adding an intrinsic for index(...) may help to reliably
> communicate that invariant to JIT:
>
> int i = 0;
> for (; i < a.length - 4*spec.length(); i += 4*spec.length()) {
> var m1 = spec.index(a.length - (i + 0*spec.length()));/ / all 1s
> ...
> var m2 = spec.index(a.length - (i + 1*spec.length())); // all 1s
> ...
> var m3 = spec.index(a.length - (i + 2*spec.length())); // all 1s
> ...
> var m4 = spec.index(a.length - (i + 3*spec.length())); // all 1s
> ...
> }
> for (; i < a.length; i += spec.length()) { // post-loop
> var m = spec.index(a.length - i);
> var va = spec.fromArray(a, i, m);
> var vb = spec.fromArray(b, i, m);
> var vc = va.mul(vb, m);
> vc.intoArray(c, i, m);
> }
>
> ... and all masked operations in main loop can be strength-reduced to
> non-masked ones:
>
> for (; i < a.length - 4*spec.length(); i += 4*spec.length()) {
> int i0 = i + 0*spec.length();
> var va0 = spec.fromArray(a, i0);
> var vb0 = spec.fromArray(b, i0);
> var vc0 = va0.mul(vb0);
> vc0.intoArray(c, i0);
> ...
> int i3 = i + 3*spec.length();
> var va3 = spec.fromArray(a, i3);
> var vb3 = spec.fromArray(b, i3);
> var vc3 = va3.mul(vb3);
> vc3.intoArray(c, i3);
> }
> for (; i < a.length; i += spec.length()) { // post-loop
> var m = spec.index(a.length - i);
> var va = spec.fromArray(a, i, m);
> var vb = spec.fromArray(b, i, m);
> var vc = va.mul(vb, m);
> vc.intoArray(c, i, m);
> }
>
> Similar trick can be applied to get pre-loop automatically generated.
>
> After loop peeling, the first iteration can serve as pre-loop:
>
> int align = align(a); // align array accesses on cache line boundaries
> for (int i = 0; i < a.length; i += (i == 0 ? align : spec.length()) {
> Mask<Float,S> m = spec.index(i == 0 ? align : a.length - i);
> var va = spec.fromArray(a, i, m);
> var vb = spec.fromArray(b, i, m);
> var vc = va.mul(vb, m);
> vc.intoArray(c, i, m);
> }
>
> ... turns into:
>
> Mask<Float,S256Bit> m = spec.index(align);
> var va = spec.fromArray(a, i, m);
> var vb = spec.fromArray(b, i, m);
> var vc = va.mul(vb, m);
> vc.intoArray(c, i, m);
>
> for (int i = align; i < a.length; i += spec.length()) {
> Mask<Float,S> m = spec.index(a.length - i);
> var va = spec.fromArray(a, i, m);
> var vb = spec.fromArray(b, i, m);
> var vc = va.mul(vb, m);
> vc.intoArray(c, i, m);
> }
>
> Such strategy allows to automatically derive 2 auxiliary loops (pre- and
> post-) during JIT compilation. The only thing which is left uncovered is
> scalar post loop. It makes it hard for JIT to generate most optimal code
> for post loop if there's no hardware support of predicated instructions
> (e.g., pre-AVX512 on x86), but main loop (where most of the work should
> happen) becomes predicate-free.
>
>
> The ugly part on API level is that all operations have to be explicitly
> masked. It would be nice to have a way to define a context mask to all
> operations. For example, have vector shapes which encapsulates both
> vector and a mask:
>
> for (int i = align; i < a.length; i += spec.length()) {
> Mask<Float,S> m = spec.index(a.length - i);
> FloatSpecies<S> mspec = spec.masked(m); // set context mask m
> var va = mspec.fromArray(a, i); // fromArray(a, i, m)
> var vb = mspec.fromArray(b, i); // fromArray(b, i, m)
> var vc = va.mul(vb); // va.mul(vb, m)
> vc.intoArray(c, i); // vc.intoArray(c, i, m)
> }
>
> Mask becomes part of vector shapes and participates in type checking, so
> not only type and size should match, but also the mask. Otherwise, an
> exception is thrown.
>
> Best regards,
> Vladimir Ivanov
>
> [1] http://cr.openjdk.java.net/~jrose/panama/token-codes.html
More information about the panama-dev
mailing list