[vector] ARM SVE (panama-dev Digest, Vol 40, Issue 28)

Wed May 16 12:49:08 UTC 2018

Hi Andrew

On 14/05/2018 18:51, Andrew Haley wrote:

Sorry for the slow resonse.

On 03/01/2018 08:43 PM, John Rose wrote:

One vector shape type I very much want to see prototyped soon is the
"loose end" shape, which is derived from a system preferred shape,
but has an odd smaller size.  Basically, it is a system-appropriate
vector which is derived from a standard vector, but with a suitable
mask or count that encodes the odd bit left over after all the full
vectors have been processed.  The vector might be either a full
vector plus count, or else a lgN sized collection of successively
half-sized sub-vectors, plus a final scalar.  Depends on the
platform, but the API is simple: It finishes your loops for your.  A
similar type (or the same in some cases) will handle the warm-up of
loops where alignment to a multi-lane block is desirable.  Clearly
SVE has its own take on how to do this.

SVE says you don't have to do this at all: it can often automagically
handle the case where a vector is an odd size, and will (also
automagically) create the mask for the tail as required.

Please forgive me, but I find it really hard to talk about this in a
purely abstract way.  I think I need an example to explain my point.
A really common case such as

  for (size_t i = 0; i < n; i++)
    *d++ = *s++;

requires no handling of heads, tails or even subvectors:

        cbz     x2, .L1
        mov     x3, 0               ; int i = 0
        mov     x4, x2              ; int tmp = n
        whilelo p0.s, xzr, x2       ; Set the predicate elements in p0 to TRUE
                                    ; for all elements of length >= 0

        uqdecw  x4                  ; Decrement tmp by n, the number of elements per vector
        ptrue   p1.s, all           ; Set all  predicate elements in p1 to TRUE

.L3:
        ld1w    z0.s, p0/z, [x0, x3, lsl 2]  ; Load n elements of z0 from s[n]
        st1w    z0.s, p0, [x1, x3, lsl 2]    ; Store n elements of z0 into d[n]
        whilelo p0.s, x3, x4                 ; Set the predicate elements in p0 to TRUE
                                             ; for all elements of length >= tmp

        incw    x3                  ; increment x4 by the number of elements per vector
        ptest   p1, p0.b            ; Test p0, setting flags
        bne     .L3
.L1:

I suspect that you've cut and pasted output from a (less than optimal) compiler. It can be done more optimally:

1) The final "ptest" is unnecessary since the "while" instructions already set the condition flags based on an "all-true" governing predicate, hence the initial "ptrue" is unnecessary.

2) You can use "uqincw" inside the loop instead of "incw" to prevent unsigned overflow of "i" if "n" is close to MAX_UINT, hence remove the initial uqdecw

3) You could choose to rely on predication to handle the n == 0 case, at the cost of performing one iteration with an empty predicate.

Therefore our "fantasy" SVE compiler might do this:

        mov     x3, 0               ; int i = 0
        whilelo p0.s, xzr, x2       ; Set the predicate elements in p0 to TRUE
                                    ; for all elements of length >= 0
.L3:    ld1w    z0.s, p0/z, [x0, x3, lsl 2]  ; Load n elements of z0 from s[n]
        st1w    z0.s, p0, [x1, x3, lsl 2]    ; Store n elements of z0 into d[n]
        uqincw  x3                  ; uns saturating increment x4 by # of 32b "words" in vector
        whilelo p0.s, x3, x4                 ; Set the predicate elements in p0 to TRUE
        b.first .L3                 ; continue while first predicate element is active

The Vector API seems to be tolerant of multiple level of
abstraction, so we can play games like that.

Mmm.  The ideal model from SVE's point of view is simply an IntVector
or somesuch.  And it seems to me that form the point of view of ease
of use, maintainability, and so on, that's am easier model for
programmers to think about.

It may even be possible to build mega-Vectors, in the same API or a
variant, which have the VCODE like property of large data dependent
sizes (and masks).  (And permutations.  At full-problem sizes, a
shuffle turns into a routing problem, with potential reductions at
collision points.  A very rich parallel computing paradigm.)  Such a
mega-vector has a close correspondence to today's streams,

Yes, exactly.  I'd love a flexible vector type, much like a stream, at the
Java level so that our programmers don't have to worry about heads and
tails but can just use operations on vectors.

As an accident of history, our Intel friends have helped us to do
much of our portability work just on x64, by providing several
different vector architectures to port to, all in one convenient
place.

I think that perhaps ARM needs to have some skin in this game.  They
have developed techniques to generate efficient code for SVE, and it
would be very useful to have some input from them.  I'm not absolutely
convinced that something like SVE is necessarily the way to go, but it
looks attractive to me, and coding vectors at a higher (stream-like)
level sounds like something I'd enjoy a lot more than messing with
heads, tails, and alignment.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.