[vector] Hash code and working with a single shape

Fri Mar 16 03:37:06 UTC 2018

On Mar 15, 2018, at 5:06 PM, Paul Sandoz <paul.sandoz at oracle.com> wrote:
> 
> Hi,
> 
> When thinking about coding to a fixed shape, perhaps defined generically, the hash code of a byte[] is a useful example. Load as many bytes into the byte vector of shape S then operate over it four times in the integer domain for an int vector of shape S. 
> 
> The challenge is to efficiently transform from the byte vector to the int vector.

So you are pulling up 64 bytes at a time, and then working on 16 words at time,
for a total of 4 times.  Each word is derived from a byte, using a cast operation,
which pulls the low subset of 16 (out of 64) byte lanes up across the whole vector.

There are two ways to proceed from here, in natural order or transposed order.
By natural order I mean that you process the words from the first 4 bytes, then
the next 4, and so on.  That's why you have written below, and it is the most
useful building-block.

In a linear problem like hashCode you could also go in transposed order, processing
the words from bytes 0/4/8/…/60, then 1/5/9/../61, and so on.  Doing it this way
does not require a cast but rather a rebracketing followed by an in-line sign extension
(or masking) operation, from byte-in-word to word.  From the HW point of view,
there is no cross-lane communication, so it may be faster.

Just as the natural order requires a cross-lane shift (shiftEL) in the 4-way loop,
the transposed order requires what looks like an in-lane shift at the word level,
which can also be viewed as a short cross-lane shift at the byte level.

A third way is to relax the requirement that you work with a constant vector
size and instead work with a constant lane size, a S128Bit for the bytes and
S512Bit for the ints.  That avoids the question of natural vs. transposed order
but it also probably makes you very x86-specific, even after generifying,
since it assumes there are two vector sizes to work with.  However, it may
be that short vectors are a reasonable thing to simulate.  Suppose the
HW supports only S512Bit natively but has nice masking operations.
Then a Java-level S128Bit could be simulated with a masked 512-bit
vector.  All memory operations on the S128Bit type would suppress
the high 3/4 of the lane transfers, and the same with other lanewise
operations.

So the choices for this hashCode are:

natural order, 4-way sub-loop, big cross-lane shifts
transposed order, 4-way sub-loop, small shifts
natural order, no sub-loop, 1/4 size byte vector transactions

(The third one was our starting point.)

Would any hardware experts out there care to comment
on which choice is efficient where?

> 
> Here’s one solution non-generic-in-shape solution:
> 
> static int hashCodeVector512Shift(byte[] a) {
>    int h = 1;
>    int i = 0;
>    for (; i < (a.length & ~(BYTE_512_SPECIES.length() - 1)); i += BYTE_512_SPECIES.length()) {
>        ByteVector<Shapes.S512Bit> b = BYTE_512_SPECIES.fromArray(a, i);
> 
>        for (int j = 0; j < BYTE_512_SPECIES.length() / INT_512_Species.length(); j++) {
>            IntVector<Shapes.S512Bit> x = (IntVector<Shapes.S512Bit>) b.cast(INT_512_Species);
>            h = h * COEFF_31_TO_16 + x.mul(H_COEFF_16).addAll();
> 
>            b = (ByteVector<Shapes.S512Bit>) b.shiftEL(INT_512_Species.length());
>        }
>    }
> 
>    for (; i < a.length; i++) {
>        h = 31 * h + a[i];
>    }
>    return h;
> }
> 
> I doubt this is optimal right now, vectorization of the shiftEL (perhaps a swizzle is a better approach), and also vectorization of the cast.

The two heavy operations here are the cast (wide rightward cross-lane movement,
plus sign extension) and the shiftEL (wide leftward cross-lane movement).

Maybe it would be better to use a set of 4 swizzles, to pluck each byte out of
the place it initially appears in the loaded vector, and place it into a byte-sized
sublane of your int vector.  Then do an in-lane sign extension (the JVM's b2i,
or shiftL(24).shiftR(24).  That would work for both of the natural order techniques
mentioned above.

(Note to self:  Remember that a swizzle is a quasi-permutation, while a shuffle
is a quasi-permutation of a stack of two vectors, returning the top half.  A
quasi-permutation, really just an index set, is allowed to duplicate some
input lanes and drop others.  In our API all quasi-permutations are represented
by Shuffle objects.)

For the transposed-order technique, no swizzle is needed, but you need a
shiftEL to move each byte down into its position within the int-lane it will
be expanded into.  Maybe the transposed-order version will be fastest
for that reason.  Note that you would have to transpose your coefficient
matrix also!

> Alternatively masks might be the longer term preferred approach to gather bytes from the byte vector into the int vector. A mask would initially be defined with the first N bits set, the cast accepts the mask, then the mask is shifted so only the next N bits of set. I don’t know how optimal this would be regarding vector instructions.

Using a bit mask to implicitly determine cross-lane motion is clever.  It's a
version of compress and expand operations in APL and other vector languages.
FTR here's an explanation, plus a bit-routing pattern called "sheep and goats":

  http://programming.sirrida.de/bit_perm.html#c_e <http://programming.sirrida.de/bit_perm.html#c_e>
  http://programming.sirrida.de/bit_perm.html#sag <http://programming.sirrida.de/bit_perm.html#sag>

On scalars, x86 calls compress "PEXT" and expand "PDEP".  I don't think
there are vector versions of this.  You are better off shifting or swizzling, I think.

For *constant* masks, it would make sense to derive a Shuffle from a Mask,
such that the Shuffle puts everything selected by the Mask, and nothing else,
at one end of the result, preserving relative order.  What happens on the
other end?  Well the SaG operation puts everything *not* selected at the
other end; the compress operation puts zeroes there.  The x86 instructions
support both about equally well, and that's probably true everywhere.

You know, the cast operation is not a bad thing to concentrate on.
It does a surprising amount of useful work.  It could be vectorized
in an ad hoc manner by boiling it down to various kinds of shuffles
and shifts (for ints, more stuff for floats).  Then we could look at those
ad hoc patterns and try to generalize them.  For example, if a cast
naturally expands its result to a larger shape, then it would make
sense to accompany it with an indication of which part of the larger
shape to return (if the large output was not desirable).  Likewise,
if a cast naturally contracts its result to a smaller shape, then
the cast could be accompanied by an indication of where to deposit
that smaller data within a larger shape (such as the original
shape).  And this could interact in interesting ways with masks.

— John