[vector] ByteVector swizzle

Thu May 31 00:44:38 UTC 2018

Hi Richard

Regarding swizzle for 256 ByteVectors, with current implementation of swizzle APIs, I have added code generation for swizzle for 8 byte and 16 byte ByteVectors by using pshufb. Would need to use permd/q for from 256 bits and beyond for pre avx512.
With avx512bw, vpermb can be used for swizzle of ByteVectors from 256 bits and onwards.
As per Paul's email, Once the Swizzle API gets stabilized, I am planning to add the intrinsics for ByteVector swizzles for 256 bits  and beyond.

Regards,
Vivek

From: Richard Startin [mailto:richard at openkappa.co.uk]
Sent: Wednesday, May 30, 2018 2:55 PM
To: Paul Sandoz <paul.sandoz at oracle.com>; Deshpande, Vivek R <vivek.r.deshpande at intel.com>
Cc: panama-dev at openjdk.java.net
Subject: Re: [vector] ByteVector swizzle

Bit count is a useful primitive in data analytics, and currently a bitset can only be reduced to its bit count in Java via a scalar loop. It is often composed with bitset intersections, unions, etc. which can be autovectorised so it ends up being the limiting factor in computing e.g. the Jaccard index of two sets.

Regarding vector bit counts, an algorithm to map a vector of four longs to a vector of four bit counts is implemented in clang: https://reviews.llvm.org/rL238636. The resultant vector can be reduced horizontally to a single bit count, or, if computing the bit count of an array, reduced vertically as part of the "Harley-Seal" reduction. This approach has been shown to be twice as fast as a scalar reduction with popcntq written in C, and is almost possible to implement currently with the vector API. The majority of Harley-Seal is easy to write with the current API, but the vector bit count is currently impossible to write. Ignoring all other constraints, there are two ways to make it possible for users to implement this:

1) Provide a 256 bit count on LongVector<Shapes.S256Bit> such as in clang.

2) Provide the building blocks for a 256 bit count

2 i) Expose _mm256_shuffle_epi8 somehow on ByteVector. This has many applications beyond bit count.

2 ii) Expose _mm256_srli_epi16 on ByteVector. I was surprised it wasn't possible to shift bytes right. Is there a fundamental reason not to?

My attempt at building my own looks like this, which doesn't compile because the contents of ByteVector can't be shifted right (see shiftR), and because the ByteVector lookups ("lookupPos" and "lookupNeg") can't be shuffled by a ByteVector control masks ("high" and "low") (see YMM_BYTE.shuffleFromVector).

private LongVector<Shapes.S256Bit> popcount256(LongVector<Shapes.S256Bit> vector) {
  var rebracketed = (ByteVector<Shapes.S256Bit>)vector.rebracket(YMM_BYTE);
  var lookupPos = YMM_BYTE.fromArray(LOOKUP_POS, 0);
  var lookupNeg = YMM_BYTE.fromArray(LOOKUP_NEG, 0);
  var lowMask = YMM_BYTE.broadcast((byte)0x0f);
  var low = rebracketed.and(lowMask);
  var high = rebracketed.shiftR(4).and(lowMask);
  return (LongVector<Shapes.S256Bit>)lookupPos.swizzle(YMM_BYTE.shuffleFromVector(low))
          .add(lookupNeg.swizzle(YMM_BYTE.shuffleFromVector(high))).rebracket(YMM_LONG);
}

private static byte[] LOOKUP_POS = new byte[] {
        /* 0 */ 4 + 0, /* 1 */ 4 + 1, /* 2 */ 4 + 1, /* 3 */ 4 + 2,
        /* 4 */ 4 + 1, /* 5 */ 4 + 2, /* 6 */ 4 + 2, /* 7 */ 4 + 3,
        /* 8 */ 4 + 1, /* 9 */ 4 + 2, /* a */ 4 + 2, /* b */ 4 + 3,
        /* c */ 4 + 2, /* d */ 4 + 3, /* e */ 4 + 3, /* f */ 4 + 4,
        /* 0 */ 4 + 0, /* 1 */ 4 + 1, /* 2 */ 4 + 1, /* 3 */ 4 + 2,
        /* 4 */ 4 + 1, /* 5 */ 4 + 2, /* 6 */ 4 + 2, /* 7 */ 4 + 3,
        /* 8 */ 4 + 1, /* 9 */ 4 + 2, /* a */ 4 + 2, /* b */ 4 + 3,
        /* c */ 4 + 2, /* d */ 4 + 3, /* e */ 4 + 3, /* f */ 4 + 4
};

private static byte[] LOOKUP_NEG = new byte[] {
        /* 0 */ 4 - 0, /* 1 */ 4 - 1, /* 2 */ 4 - 1, /* 3 */ 4 - 2,
        /* 4 */ 4 - 1, /* 5 */ 4 - 2, /* 6 */ 4 - 2, /* 7 */ 4 - 3,
        /* 8 */ 4 - 1, /* 9 */ 4 - 2, /* a */ 4 - 2, /* b */ 4 - 3,
        /* c */ 4 - 2, /* d */ 4 - 3, /* e */ 4 - 3, /* f */ 4 - 4,
        /* 0 */ 4 - 0, /* 1 */ 4 - 1, /* 2 */ 4 - 1, /* 3 */ 4 - 2,
        /* 4 */ 4 - 1, /* 5 */ 4 - 2, /* 6 */ 4 - 2, /* 7 */ 4 - 3,
        /* 8 */ 4 - 1, /* 9 */ 4 - 2, /* a */ 4 - 2, /* b */ 4 - 3,
        /* c */ 4 - 2, /* d */ 4 - 3, /* e */ 4 - 3, /* f */ 4 - 4
};

There is a reference implementation in C in CRoaring: https://github.com/RoaringBitmap/CRoaring/blob/35162bcb4e959b22af298cdbb73b8c654a1ff70f/include/roaring/bitset_util.h#L265

Thanks,

Richard

________________________________
From: Paul Sandoz <paul.sandoz at oracle.com<mailto:paul.sandoz at oracle.com>>
Sent: 30 May 2018 20:05:56
To: Richard Startin; Deshpande, Vivek R
Cc: panama-dev at openjdk.java.net<mailto:panama-dev at openjdk.java.net>
Subject: Re: [vector] ByteVector swizzle

Hi Richard,

Your email is timely. I was just proposing to change this part of the API to more generally convert a vector to a shuffle and vice versa. The current approach is too limiting, see:

  http://cr.openjdk.java.net/~psandoz/panama/shuffle-shuffle/webrev/

Vivek is working on the swizzle intrinsic, see:

  http://cr.openjdk.java.net/~vdeshpande/VectorAPI_swizzle/webrev.00/

he can talk more about the code generation. I suspect we may need a few iterations to get more optimal (including optimal runtime representations of Shuffle itself). Any help/analysis would be greatly appreciated.

I want to reimplement the cross-lane operations rotateEL/rotateER/shiftEL/shiftER using swizzle. The current plan is to focus on a set of *Vector methods that are intrinsic and can be developed with our current engineering resources, and in addition include useful methods composed from those intrinsics that generate good code.

I am not familiar with the algorithms for vector bit counts but if we can compose vector bit count from existing intrinsics then it's a strong candidate to add to the API now (otherwise we will need to evaluate the work required for a specific intrinsic and consider deferring to a later revision, and possibly add the non-optimal method to the helper classes).

Thanks,
Paul.

> On May 29, 2018, at 8:46 AM, Richard Startin <richard at openkappa.co.uk<mailto:richard at openkappa.co.uk>> wrote:
>
> I would like to permute the elements of a ByteVector<Shapes.S256Bit>. I would expect the semantics to be similar to _mm256_shuffle_epi8/VPSHUFB, which would entail the use of a ByteVector<Shapes.S256Bit> to specify the permutation. Currently Species.shuffleFromVector only supports Vector<Integer, Shape> as an argument. Is there a plan to expand this?
>
> Will swizzle in its current state use VPSHUFD/_mm256_shiffle_epi32?
>
> For what it's worth my aim is to implement a bit count for a 256 bit integral vector, where for a DIY implementation a ByteVector.shiftLeft would be ideal, but without access to VPSHUFB it's impossible. There are well known algorithms for vector bitcounts, perhaps this would be a useful addition to the API?
>
> Kind regards,
> Richard Startin