[vector] Partial permutations

Fri Feb 1 23:42:52 UTC 2019

> I looked at your population count study - it looks very promising - is 
> there any chance that this will end up available on the 
> IntVector|LongVector|ShortVector|ByteVector interfaces? That way the 
> user could benefit from the AVX-512 instruction on the right hardware.

Thanks.

Though the goal of the exercise was different, I believe that bitCount() 
is a good example of a functionality which deserves to be put on Vector: 
generic primitive which benefits a lot from hardware support.

The only obstacle on x86 I see is providing SSE/AVX/AVX2 
implementations, though it can be covered by default implementation at 
first.

Best regards,
Vladimir Ivanov

> ------------------------------------------------------------------------
> *From:* Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> *Sent:* 30 January 2019 22:29:20
> *To:* Richard Startin; John Rose
> *Cc:* panama-dev at openjdk.java.net
> *Subject:* Re: [vector] Partial permutations
> 
> 
>> The point of sharing this example is not just that I really hope it will one day be possible to use the API to count
>> 
>> bits quickly (in which case it would probably be even better implemented within the API, using the AVX-512 vpopcnt
>> 
>> extension when available!), but to draw attention to an existing idiom that I suspect may not be served well by the
>> 
>> semantics of Vector.rearrange, even if it is currently the most appropriate approximation in the current API. I leave
>> 
>> it to the experts to think about whether or how to make this possible, but I hope that it is helpful to draw attention
>> 
>> to the kinds of things that would occur to a potential user to implement with the API. It wouldn't matter much to me as a
>> 
>> potential user whether this would be expressed as a gather or as a permutation, or even if there were a secret incantation
>> 
>> expressible in terms of masks: I just anticipate the need to do parallel lookups without undue overhead.
> 
> FTR I've been doing a similar exercise recently and ended up with the
> following examples: PopulationCount [1], SumOfUnsignedBytes [2],
> SortVector [3], Merge [4].
> 
> I ended up with a number of idioms which would be good to optimize on
> the backend side until there's a way to express them more directly.
> 
> 
>> Regarding the SWAR hacks, there's a reason for them: lack of unsigned arithmetic. I don't think it is
>> 
>> possible to implement the horizontal reduction in terms of signed byte arithmetic. I think that lack
>> 
>> of unsigned arithmetic is another obstacle to lifting and shifting existing vector algorithms to the vector API.
>> 
>> I have previously tried and failed to implement StreamVByte compression (which incidentally also requires
>> 
>> similar permutations and species changes) with the vector API, for lack of unsigned 8 bit arithmetic.
> 
> I had similar experience, plus saturated mode & SAD (sum of absolute
> differences).
> 
> Regarding your original question:
>   > It would be nice to be able to hint somehow that only half of the
> result need be permuted, allowing the permutation to "short circuit" -
> potentially using faster code.
> 
> IMO the most promising approach is to split vector in parts, permute
> some of them, and then merge them back. I believe that's the best way to
> communicate the semantics of the operation to the compiler.
> 
> Best regards,
> Vladimir Ivanov
> 
> [1]
> http://hg.openjdk.java.net/panama/dev/raw-file/0447b2b7d707/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/jdk/incubator/vector/PopulationCount.java
> 
> [2]
> http://hg.openjdk.java.net/panama/dev/raw-file/0447b2b7d707/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/jdk/incubator/vector/SumOfUnsignedBytes.java
> 
> [3]
> http://hg.openjdk.java.net/panama/dev/raw-file/0447b2b7d707/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/jdk/incubator/vector/SortVector.java
> 
> [4]
> http://hg.openjdk.java.net/panama/dev/raw-file/0447b2b7d707/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/jdk/incubator/vector/Merge.java
> 
>> ________________________________
>> From: John Rose <john.r.rose at oracle.com>
>> Sent: 30 January 2019 01:34:53
>> To: Richard Startin
>> Cc: panama-dev at openjdk.java.net
>> Subject: Re: [vector] Partial permutations
>> 
>> On Jan 29, 2019, at 2:44 PM, Richard Startin <richard at openkappa.co.uk<mailto:richard at openkappa.co.uk>> wrote:
>> 
>>         lo = lo.add(counts.rearrange(v1.and(0x0F0F0F0F).rebracket(YMM_BYTE).toShuffle()).rebracket(YMM_INT));
>> 
>> Breaking it down:
>> 
>> var v1 = …;  // u4[8]
>> var nibblePerByte = v1.and(0x0F0F0F0F).rebracket(YMM_BYTE);  // u1[32]
>> var bitCountPerByte = counts.rearrange(nibblePerByte.toShuffle());  // u1[32]
>> var swarLanes = bitCountPerByte.rebracket(YMM_INT);  // u4[8]
>> lo = lo.add(expandedCounts);  // danger: carry-out across bytes?
>> 
>> The second line zeroes out half of the bit positions and breaks out the
>> 8 int lanes into 32 byte lanes.
>> 
>> The third line replaces each byte (whose value is in 0..15) with another
>> byte that gives the bit-count (value in 0..4).
>> 
>> The fourth line undoes the action of the second line.  Now we have
>> four lanes each of which is an integer containing 4 "SIMD across a
>> register" sub-lanes.
>> 
>> It seems like the top half of your NIBBLE_COUNTS table could be any
>> old garbage, not a copy of the bottom half.
>> 
>> (Did I get it right?  This kind of code is tricky!)
>> 
>> The final line adds 8x4 lanes and their sub-lanes.  I think you have
>> a bug in your block size, since if all the input bits are set, then
>> each nibble count will always be 4, and so each sub-lane count
>> will accumulate that value, 256/4 times, which will cause a carry
>> to propagate across your SWAR sub-lanes.
>> 
>> The idiom counts.rearrange(nibblePerByte.toShuffle()) would read better,
>> in this algorithm, as nibblePerByte.selectFrom(counts).  It's really akin to
>> a gather operation in this use case, and the similarity could be made
>> more clear by using a gather-like API (which we are still working on).
>> 
>> To respond to your question:  If I follow you, you note that since the
>> top half of the NIBBLE_COUNTS array is never used, then any instructions
>> which would do the work of loading from that half of the array are useless.
>> 
>> To me, this calls the question of how to do strength-reduction optimizations
>> on shuffles.
>> 
>> I don't think we want explicit API points for simplified shuffles, any more
>> than we want library routines or operators for division when both operands
>> are positive, even though there are optimizations relevant in both cases.
>> 
>> The best way to get improved code for simplified inputs is to work hard
>> to detect those inputs and adjust the instruction selection (or IR
>> generation) accordingly.  In this case, the shuffle vector is the logical
>> "and" of 0x0F and some random input, which means that, ideally the
>> optimizer should be able to prove that the selected lanes of "counts"
>> are in the range 0..15.  That's probably enough to simplify a complex
>> shuffle idiom in the back end.
>> 
>> Failing lanewise type analysis (but why should we fail that?), a pattern
>> match in the C2 backend might also detect that the shuffle was immediately
>> derived from a masking operation.  In either case, the optimization might
>> be confounded by the SWAR hacks going on.  Perhaps the right answer
>> is to rebracket to YMM_BYTE before the logical "and" (of a byte 0x0F).
>> This is sadly delicate, but can be made more robust by an improvement
>> to the C2 type system to track individual bit positions, or adoption of Graal,
>> which already has that feature.
>> 
>> The overall point here is that some of our good code will come from a
>> strength reduction of expensive operations, gated by type and pattern
>> analysis.  The design problem is to put enough primitives into the user
>> API but not too many, and enough optimization in the backend.
>> 
>> — John
>>