[vector] ByteVector swizzle

Thu May 31 01:13:46 UTC 2018

Hi Richard,

Thanks, a very nice use-case. FWIW some massaging when translating the CRoaring example would be required since the swizzle semantics of the Vector API differs from _mm256_shuffle_epi8 (shuffling within 128 bit lanes).

Regarding swizzle. My hope is that we can provide a solution soon, but as i said it might require a few rounds to get right.

Regarding logical shifts. We recently made a decision to remove the bit shifting on byte/short vectors so we can focus our engineering efforts on other areas. After some analysis we concluded we needed more time to evaluate/implement:

  http://mail.openjdk.java.net/pipermail/panama-dev/2018-May/001730.html

I hope we can revisit these operations after we have completed our first round of optimized operations. It may be possible we could support more quickly shifting on short vectors with deviation from the Java semantics. With additional rebracketing to a ShortVector i believe it should support the popcount use-case. 

(BTW i have been lobbying for a bit count on Mask :-) but it’s down in the priority list as well.)

Thanks,
Paul.

> On May 30, 2018, at 2:54 PM, Richard Startin <richard at openkappa.co.uk> wrote:
> 
> Bit count is a useful primitive in data analytics, and currently a bitset can only be reduced to its bit count in Java via a scalar loop. It is often composed with bitset intersections, unions, etc. which can be autovectorised so it ends up being the limiting factor in computing e.g. the Jaccard index of two sets. 
> 
> Regarding vector bit counts, an algorithm to map a vector of four longs to a vector of four bit counts is implemented in clang:https://reviews.llvm.org/rL238636. The resultant vector can be reduced horizontally to a single bit count, or, if computing the bit count of an array, reduced vertically as part of the "Harley-Seal" reduction. This approach has been shown to be twice as fast as a scalar reduction with popcntq written in C, and is almost possible to implement currently with the vector API. The majority of Harley-Seal is easy to write with the current API, but the vector bit count is currently impossible to write. Ignoring all other constraints, there are two ways to make it possible for users to implement this: 
> 
> 1) Provide a 256 bit count on LongVector<Shapes.S256Bit> such as in clang. 
> 2) Provide the building blocks for a 256 bit count
> 2 i) Expose _mm256_shuffle_epi8 somehow on ByteVector. This has many applications beyond bit count.
> 2 ii) Expose _mm256_srli_epi16 on ByteVector. I was surprised it wasn't possible to shift bytes right. Is there a fundamental reason not to?
> 
> My attempt at building my own looks like this, which doesn't compile because the contents of ByteVector can't be shifted right (see shiftR), and because the ByteVector lookups ("lookupPos" and "lookupNeg") can't be shuffled by a ByteVector control masks ("high" and "low") (see YMM_BYTE.shuffleFromVector). 
> 
> private LongVector<Shapes.S256Bit> popcount256(LongVector<Shapes.S256Bit> vector) {
>   var rebracketed = (ByteVector<Shapes.S256Bit>)vector.rebracket(YMM_BYTE);
>   var lookupPos = YMM_BYTE.fromArray(LOOKUP_POS, 0);
>   var lookupNeg = YMM_BYTE.fromArray(LOOKUP_NEG, 0);
>   var lowMask = YMM_BYTE.broadcast((byte)0x0f);
>   var low = rebracketed.and(lowMask);
>   var high = rebracketed.shiftR(4).and(lowMask);
>   return (LongVector<Shapes.S256Bit>)lookupPos.swizzle(YMM_BYTE.shuffleFromVector(low))
>           .add(lookupNeg.swizzle(YMM_BYTE.shuffleFromVector(high))).rebracket(YMM_LONG);
> }
> 
> private static byte[] LOOKUP_POS = new byte[] {
>         /* 0 */ 4 + 0, /* 1 */ 4 + 1, /* 2 */ 4 + 1, /* 3 */ 4 + 2,
>         /* 4 */ 4 + 1, /* 5 */ 4 + 2, /* 6 */ 4 + 2, /* 7 */ 4 + 3,
>         /* 8 */ 4 + 1, /* 9 */ 4 + 2, /* a */ 4 + 2, /* b */ 4 + 3,
>         /* c */ 4 + 2, /* d */ 4 + 3, /* e */ 4 + 3, /* f */ 4 + 4,
>         /* 0 */ 4 + 0, /* 1 */ 4 + 1, /* 2 */ 4 + 1, /* 3 */ 4 + 2,
>         /* 4 */ 4 + 1, /* 5 */ 4 + 2, /* 6 */ 4 + 2, /* 7 */ 4 + 3,
>         /* 8 */ 4 + 1, /* 9 */ 4 + 2, /* a */ 4 + 2, /* b */ 4 + 3,
>         /* c */ 4 + 2, /* d */ 4 + 3, /* e */ 4 + 3, /* f */ 4 + 4
> };
> 
> private static byte[] LOOKUP_NEG = new byte[] {
>         /* 0 */ 4 - 0, /* 1 */ 4 - 1, /* 2 */ 4 - 1, /* 3 */ 4 - 2,
>         /* 4 */ 4 - 1, /* 5 */ 4 - 2, /* 6 */ 4 - 2, /* 7 */ 4 - 3,
>         /* 8 */ 4 - 1, /* 9 */ 4 - 2, /* a */ 4 - 2, /* b */ 4 - 3,
>         /* c */ 4 - 2, /* d */ 4 - 3, /* e */ 4 - 3, /* f */ 4 - 4,
>         /* 0 */ 4 - 0, /* 1 */ 4 - 1, /* 2 */ 4 - 1, /* 3 */ 4 - 2,
>         /* 4 */ 4 - 1, /* 5 */ 4 - 2, /* 6 */ 4 - 2, /* 7 */ 4 - 3,
>         /* 8 */ 4 - 1, /* 9 */ 4 - 2, /* a */ 4 - 2, /* b */ 4 - 3,
>         /* c */ 4 - 2, /* d */ 4 - 3, /* e */ 4 - 3, /* f */ 4 - 4
> };
> 
> There is a reference implementation in C in CRoaring:https://github.com/RoaringBitmap/CRoaring/blob/35162bcb4e959b22af298cdbb73b8c654a1ff70f/include/roaring/bitset_util.h#L265
> 
> Thanks,
> Richard
> From: Paul Sandoz <paul.sandoz at oracle.com>
> Sent: 30 May 2018 20:05:56
> To: Richard Startin; Deshpande, Vivek R
> Cc: panama-dev at openjdk.java.net
> Subject: Re: [vector] ByteVector swizzle
>  
> Hi Richard,
> 
> Your email is timely. I was just proposing to change this part of the API to more generally convert a vector to a shuffle and vice versa. The current approach is too limiting, see:
> 
>   http://cr.openjdk.java.net/~psandoz/panama/shuffle-shuffle/webrev/
> 
> 
> Vivek is working on the swizzle intrinsic, see:
> 
>   http://cr.openjdk.java.net/~vdeshpande/VectorAPI_swizzle/webrev.00/
> 
> he can talk more about the code generation. I suspect we may need a few iterations to get more optimal (including optimal runtime representations of Shuffle itself). Any help/analysis would be greatly appreciated.
> 
> 
> I want to reimplement the cross-lane operations rotateEL/rotateER/shiftEL/shiftER using swizzle. The current plan is to focus on a set of *Vector methods that are intrinsic and can be developed with our current engineering resources, and in addition include useful methods composed from those intrinsics that generate good code.
> 
> I am not familiar with the algorithms for vector bit counts but if we can compose vector bit count from existing intrinsics then it's a strong candidate to add to the API now (otherwise we will need to evaluate the work required for a specific intrinsic and consider deferring to a later revision, and possibly add the non-optimal method to the helper classes).
> 
> Thanks,
> Paul. 
> 
> > On May 29, 2018, at 8:46 AM, Richard Startin <richard at openkappa.co.uk> wrote:
> > 
> > I would like to permute the elements of a ByteVector<Shapes.S256Bit>. I would expect the semantics to be similar to _mm256_shuffle_epi8/VPSHUFB, which would entail the use of a ByteVector<Shapes.S256Bit> to specify the permutation. Currently Species.shuffleFromVector only supports Vector<Integer, Shape> as an argument. Is there a plan to expand this? 
> > 
> > Will swizzle in its current state use VPSHUFD/_mm256_shiffle_epi32?
> > 
> > For what it’s worth my aim is to implement a bit count for a 256 bit integral vector, where for a DIY implementation a ByteVector.shiftLeft would be ideal, but without access to VPSHUFB it’s impossible. There are well known algorithms for vector bitcounts, perhaps this would be a useful addition to the API?
> > 
> > Kind regards,
> > Richard Startin