[vector] Partial permutations
Richard Startin
richard at openkappa.co.uk
Tue Jan 29 22:44:47 UTC 2019
Quite a lot of vector code uses shuffle instructions, not with the intent of rearranging a vector, but of doing a parallel lookup as fast as possible, even if it means wasting some of the result. For instance, ignoring one of vpshufb's parallel permutations it can be used to look up 16 values by nibble-valued keys in a single instruction. This technique is used in existing implementations of base 64 encoding, and various compression algorithms.
As an example of how this would look in the vector API, below is a routine to count the bits in a bitmap which actually works even if it isn't very fast yet. The ByteVector.rearrange looks up the precomputed bit counts of the 16 nibbles in each vector, it doesn't seem to have been implemented yet, but if the rearrange method must do a full 32 byte permutation (requiring several instructions) I suspect it might be doing too much work for the routine below to be viable (compared to a loop of Long.bitCount calls).
It would be nice to be able to hint somehow that only half of the result need be permuted, allowing the permutation to "short circuit" - potentially using faster code.
private static byte[] NIBBLE_COUNTS = new byte[] {
0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4
};
private long[] data;
@Benchmark
public int vectorBitCount() {
int bitCount = 0;
int block = 256;
for (int i = 0; i < data.length; i += block) {
var lo = YMM_INT.zero();
var hi = YMM_INT.zero();
var counts = YMM_BYTE.fromArray(NIBBLE_COUNTS, 0);
for (int j = 0; j < block; j += 4) {
var v1 = (IntVector)YMM_LONG.fromArray(data, i + j).rebracket(YMM_INT);
lo = lo.add(counts.rearrange(v1.and(0x0F0F0F0F).rebracket(YMM_BYTE).toShuffle()).rebracket(YMM_INT));
hi = hi.add(counts.rearrange(v1.shiftR(4).and(0x0F0F0F0F).rebracket(YMM_BYTE).toShuffle()).rebracket(YMM_INT));
}
bitCount += unsignedSum(lo);
bitCount += unsignedSum(hi);
}
return bitCount;
}
private int unsignedSum(IntVector bv) {
// convert to LongVector because Vector.get is slow
var lv = (LongVector) bv.rebracket(YMM_LONG);
return sumBytes(lv.get(0))
+ sumBytes(lv.get(1))
+ sumBytes(lv.get(2))
+ sumBytes(lv.get(3));
}
private int sumBytes(long w) {
return ((int)w & 0xFF)
+ (((int)(w >>> 8)) & 0xFF)
+ (((int)(w >>> 16)) & 0xFF)
+ (((int)(w >>> 24)) & 0xFF)
+ (((int)(w >>> 32)) & 0xFF)
+ (((int)(w >>> 40)) & 0xFF)
+ (((int)(w >>> 48)) & 0xFF)
+ (((int)(w >>> 56)) & 0xFF);
}
More information about the panama-dev
mailing list