[vector] Partial permutations

Fri Feb 1 22:19:59 UTC 2019

I looked at your population count study - it looks very promising - is there any chance that this will end up available on the IntVector|LongVector|ShortVector|ByteVector interfaces? That way the user could benefit from the AVX-512 instruction on the right hardware.

________________________________
From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
Sent: 30 January 2019 22:29:20
To: Richard Startin; John Rose
Cc: panama-dev at openjdk.java.net
Subject: Re: [vector] Partial permutations

> The point of sharing this example is not just that I really hope it will one day be possible to use the API to count
>
> bits quickly (in which case it would probably be even better implemented within the API, using the AVX-512 vpopcnt
>
> extension when available!), but to draw attention to an existing idiom that I suspect may not be served well by the
>
> semantics of Vector.rearrange, even if it is currently the most appropriate approximation in the current API. I leave
>
> it to the experts to think about whether or how to make this possible, but I hope that it is helpful to draw attention
>
> to the kinds of things that would occur to a potential user to implement with the API. It wouldn't matter much to me as a
>
> potential user whether this would be expressed as a gather or as a permutation, or even if there were a secret incantation
>
> expressible in terms of masks: I just anticipate the need to do parallel lookups without undue overhead.

FTR I've been doing a similar exercise recently and ended up with the
following examples: PopulationCount [1], SumOfUnsignedBytes [2],
SortVector [3], Merge [4].

I ended up with a number of idioms which would be good to optimize on
the backend side until there's a way to express them more directly.

> Regarding the SWAR hacks, there's a reason for them: lack of unsigned arithmetic. I don't think it is
>
> possible to implement the horizontal reduction in terms of signed byte arithmetic. I think that lack
>
> of unsigned arithmetic is another obstacle to lifting and shifting existing vector algorithms to the vector API.
>
> I have previously tried and failed to implement StreamVByte compression (which incidentally also requires
>
> similar permutations and species changes) with the vector API, for lack of unsigned 8 bit arithmetic.

I had similar experience, plus saturated mode & SAD (sum of absolute
differences).

Regarding your original question:
 > It would be nice to be able to hint somehow that only half of the
result need be permuted, allowing the permutation to "short circuit" -
potentially using faster code.

IMO the most promising approach is to split vector in parts, permute
some of them, and then merge them back. I believe that's the best way to
communicate the semantics of the operation to the compiler.

Best regards,
Vladimir Ivanov

[1]
http://hg.openjdk.java.net/panama/dev/raw-file/0447b2b7d707/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/jdk/incubator/vector/PopulationCount.java

[2]
http://hg.openjdk.java.net/panama/dev/raw-file/0447b2b7d707/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/jdk/incubator/vector/SumOfUnsignedBytes.java

[3]
http://hg.openjdk.java.net/panama/dev/raw-file/0447b2b7d707/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/jdk/incubator/vector/SortVector.java

[4]
http://hg.openjdk.java.net/panama/dev/raw-file/0447b2b7d707/test/jdk/jdk/incubator/vector/benchmark/src/main/java/benchmark/jdk/incubator/vector/Merge.java

> ________________________________
> From: John Rose <john.r.rose at oracle.com>
> Sent: 30 January 2019 01:34:53
> To: Richard Startin
> Cc: panama-dev at openjdk.java.net
> Subject: Re: [vector] Partial permutations
>
> On Jan 29, 2019, at 2:44 PM, Richard Startin <richard at openkappa.co.uk<mailto:richard at openkappa.co.uk>> wrote:
>
>         lo = lo.add(counts.rearrange(v1.and(0x0F0F0F0F).rebracket(YMM_BYTE).toShuffle()).rebracket(YMM_INT));
>
> Breaking it down:
>
> var v1 = …;  // u4[8]
> var nibblePerByte = v1.and(0x0F0F0F0F).rebracket(YMM_BYTE);  // u1[32]
> var bitCountPerByte = counts.rearrange(nibblePerByte.toShuffle());  // u1[32]
> var swarLanes = bitCountPerByte.rebracket(YMM_INT);  // u4[8]
> lo = lo.add(expandedCounts);  // danger: carry-out across bytes?
>
> The second line zeroes out half of the bit positions and breaks out the
> 8 int lanes into 32 byte lanes.
>
> The third line replaces each byte (whose value is in 0..15) with another
> byte that gives the bit-count (value in 0..4).
>
> The fourth line undoes the action of the second line.  Now we have
> four lanes each of which is an integer containing 4 "SIMD across a
> register" sub-lanes.
>
> It seems like the top half of your NIBBLE_COUNTS table could be any
> old garbage, not a copy of the bottom half.
>
> (Did I get it right?  This kind of code is tricky!)
>
> The final line adds 8x4 lanes and their sub-lanes.  I think you have
> a bug in your block size, since if all the input bits are set, then
> each nibble count will always be 4, and so each sub-lane count
> will accumulate that value, 256/4 times, which will cause a carry
> to propagate across your SWAR sub-lanes.
>
> The idiom counts.rearrange(nibblePerByte.toShuffle()) would read better,
> in this algorithm, as nibblePerByte.selectFrom(counts).  It's really akin to
> a gather operation in this use case, and the similarity could be made
> more clear by using a gather-like API (which we are still working on).
>
> To respond to your question:  If I follow you, you note that since the
> top half of the NIBBLE_COUNTS array is never used, then any instructions
> which would do the work of loading from that half of the array are useless.
>
> To me, this calls the question of how to do strength-reduction optimizations
> on shuffles.
>
> I don't think we want explicit API points for simplified shuffles, any more
> than we want library routines or operators for division when both operands
> are positive, even though there are optimizations relevant in both cases.
>
> The best way to get improved code for simplified inputs is to work hard
> to detect those inputs and adjust the instruction selection (or IR
> generation) accordingly.  In this case, the shuffle vector is the logical
> "and" of 0x0F and some random input, which means that, ideally the
> optimizer should be able to prove that the selected lanes of "counts"
> are in the range 0..15.  That's probably enough to simplify a complex
> shuffle idiom in the back end.
>
> Failing lanewise type analysis (but why should we fail that?), a pattern
> match in the C2 backend might also detect that the shuffle was immediately
> derived from a masking operation.  In either case, the optimization might
> be confounded by the SWAR hacks going on.  Perhaps the right answer
> is to rebracket to YMM_BYTE before the logical "and" (of a byte 0x0F).
> This is sadly delicate, but can be made more robust by an improvement
> to the C2 type system to track individual bit positions, or adoption of Graal,
> which already has that feature.
>
> The overall point here is that some of our good code will come from a
> strength reduction of expensive operations, gated by type and pattern
> analysis.  The design problem is to put enough primitives into the user
> API but not too many, and enough optimization in the backend.
>
> — John
>