[vector] Partial permutations

Wed Jan 30 20:47:28 UTC 2019

Yes, that's the gist of it. The aim is to avoid horizontal reductions and do as much accumulation in parallel as possible.

You are also correct that the upper 16 bytes in counts is superfluous, but C/C++ programmers seem to do this a lot

because vpshufb is so cheap. You are correct that there is a bug when all bits are set - there are two ways round it:

change the block to 128 or unroll the inner loop so there are two low accumulators and two high accumulators,

which avoids overflow. I didn't get round to deciding which way to go before I realised that ByteVector.rearrange

wasn't compiling to anything exciting yet (for AVX2 at least). I look forward to revisiting this calculation in the future.

The point of sharing this example is not just that I really hope it will one day be possible to use the API to count

bits quickly (in which case it would probably be even better implemented within the API, using the AVX-512 vpopcnt

extension when available!), but to draw attention to an existing idiom that I suspect may not be served well by the

semantics of Vector.rearrange, even if it is currently the most appropriate approximation in the current API. I leave

it to the experts to think about whether or how to make this possible, but I hope that it is helpful to draw attention

to the kinds of things that would occur to a potential user to implement with the API. It wouldn't matter much to me as a

potential user whether this would be expressed as a gather or as a permutation, or even if there were a secret incantation

expressible in terms of masks: I just anticipate the need to do parallel lookups without undue overhead.

Regarding the SWAR hacks, there's a reason for them: lack of unsigned arithmetic. I don't think it is

possible to implement the horizontal reduction in terms of signed byte arithmetic. I think that lack

of unsigned arithmetic is another obstacle to lifting and shifting existing vector algorithms to the vector API.

I have previously tried and failed to implement StreamVByte compression (which incidentally also requires

similar permutations and species changes) with the vector API, for lack of unsigned 8 bit arithmetic.

________________________________
From: John Rose <john.r.rose at oracle.com>
Sent: 30 January 2019 01:34:53
To: Richard Startin
Cc: panama-dev at openjdk.java.net
Subject: Re: [vector] Partial permutations

On Jan 29, 2019, at 2:44 PM, Richard Startin <richard at openkappa.co.uk<mailto:richard at openkappa.co.uk>> wrote:

       lo = lo.add(counts.rearrange(v1.and(0x0F0F0F0F).rebracket(YMM_BYTE).toShuffle()).rebracket(YMM_INT));

Breaking it down:

var v1 = …;  // u4[8]
var nibblePerByte = v1.and(0x0F0F0F0F).rebracket(YMM_BYTE);  // u1[32]
var bitCountPerByte = counts.rearrange(nibblePerByte.toShuffle());  // u1[32]
var swarLanes = bitCountPerByte.rebracket(YMM_INT);  // u4[8]
lo = lo.add(expandedCounts);  // danger: carry-out across bytes?

The second line zeroes out half of the bit positions and breaks out the
8 int lanes into 32 byte lanes.

The third line replaces each byte (whose value is in 0..15) with another
byte that gives the bit-count (value in 0..4).

The fourth line undoes the action of the second line.  Now we have
four lanes each of which is an integer containing 4 "SIMD across a
register" sub-lanes.

It seems like the top half of your NIBBLE_COUNTS table could be any
old garbage, not a copy of the bottom half.

(Did I get it right?  This kind of code is tricky!)

The final line adds 8x4 lanes and their sub-lanes.  I think you have
a bug in your block size, since if all the input bits are set, then
each nibble count will always be 4, and so each sub-lane count
will accumulate that value, 256/4 times, which will cause a carry
to propagate across your SWAR sub-lanes.

The idiom counts.rearrange(nibblePerByte.toShuffle()) would read better,
in this algorithm, as nibblePerByte.selectFrom(counts).  It's really akin to
a gather operation in this use case, and the similarity could be made
more clear by using a gather-like API (which we are still working on).

To respond to your question:  If I follow you, you note that since the
top half of the NIBBLE_COUNTS array is never used, then any instructions
which would do the work of loading from that half of the array are useless.

To me, this calls the question of how to do strength-reduction optimizations
on shuffles.

I don't think we want explicit API points for simplified shuffles, any more
than we want library routines or operators for division when both operands
are positive, even though there are optimizations relevant in both cases.

The best way to get improved code for simplified inputs is to work hard
to detect those inputs and adjust the instruction selection (or IR
generation) accordingly.  In this case, the shuffle vector is the logical
"and" of 0x0F and some random input, which means that, ideally the
optimizer should be able to prove that the selected lanes of "counts"
are in the range 0..15.  That's probably enough to simplify a complex
shuffle idiom in the back end.

Failing lanewise type analysis (but why should we fail that?), a pattern
match in the C2 backend might also detect that the shuffle was immediately
derived from a masking operation.  In either case, the optimization might
be confounded by the SWAR hacks going on.  Perhaps the right answer
is to rebracket to YMM_BYTE before the logical "and" (of a byte 0x0F).
This is sadly delicate, but can be made more robust by an improvement
to the C2 type system to track individual bit positions, or adoption of Graal,
which already has that feature.

The overall point here is that some of our good code will come from a
strength reduction of expensive operations, gated by type and pattern
analysis.  The design problem is to put enough primitives into the user
API but not too many, and enough optimization in the backend.

— John