[vector api] VarInt Compression - Am I doing something wrong?

Sun Mar 10 16:40:12 UTC 2019

Hello guys, recently I've been playing with vector API in an attempt to
implement variable length integer encoding.

Basically, what I'm doing is testing if 8 contiguous integers simultaneosly
fall into any of the 4 ranges: [0,0x80), [0x80,0x4000), [0x4000,0x200000)
or [0x200000,0x10000000), so (parts of) the following scalar code can be
vectorized:

public static int compressScalar(int[] src, byte[] dst, int soff, int doff,
int slim) {
for (int i = soff; i < slim; i++) {
int v = src[i];

if (v >>> 7 == 0) {
dst[doff++] = (byte) v;
} else if (v >>> 14 == 0) {
dst[doff++] = (byte) (v & 0x7F | 0x80);
dst[doff++] = (byte) (v >>> 7);
} else if (v >>> 21 == 0) {
dst[doff++] = (byte) (v & 0x7F | 0x80);
dst[doff++] = (byte) (v >>> 7 | 0x80);
dst[doff++] = (byte) (v >>> 14);
} else if (v >>> 28 == 0) {
dst[doff++] = (byte) (v & 0x7F | 0x80);
dst[doff++] = (byte) (v >>> 7 | 0x80);
dst[doff++] = (byte) (v >>> 14 | 0x80);
dst[doff++] = (byte) (v >>> 21);
} else {
dst[doff++] = (byte) (v & 0x7F | 0x80);
dst[doff++] = (byte) (v >>> 7 | 0x80);
dst[doff++] = (byte) (v >>> 14 | 0x80);
dst[doff++] = (byte) (v >>> 21 | 0x80);
dst[doff++] = (byte) (v >>> 28);
}
}
return doff;
}

The corresponding vectorized code is an attempt to port a C implementation (
https://github.com/cmuramoto/panama-playground/blob/master/c/compressor.c)
and looks like this:

public static int compress(int[] src, byte[] dst, int soff, int doff) {
int slim = (src.length - src.length % 8) - 8;
int tail = (src.length - soff) % 8;

while (soff <= slim) {
var chunk = _mm256i.fromArray(src, soff, allTrue);

if (chunk.lessThan(byte1_test).allTrue()) {
//encode and write 8 bytes
} else if (chunk.lessThan(byte2_test).allTrue()) {
//encod and write 16 bytes
} else if (chunk.lessThan(byte3_test).allTrue()) {
//encode and write 24 bytes
} else if (chunk.lessThan(byte4_test).allTrue()) {
var q0 = (ByteVector)
chunk.and(0x7F).or(0x80).and(0xFF).reinterpret(_mm256i_byte);

var q1 = (ByteVector)
chunk.shiftR(7).or(0x80).and(0xFF).reinterpret(_mm256i_byte);
q1 = q1.shiftER(1);

var q2 = (ByteVector)
chunk.shiftR(14).or(0x80).and(0xFF).reinterpret(_mm256i_byte);
q2 = q2.shiftER(2);

var q3 = (ByteVector) chunk.shiftR(21).and(0xFF).reinterpret(_mm256i_byte);
q3 = q3.shiftER(3);

var store = q0.or(q1.or(q2.or(q3)));

store.intoArray(dst, doff);

doff += 32;
} else {
// fallback
doff = compressScalar(src, dst, soff, doff, soff + 8);
}

soff += 8;
}

if (tail != 0) {
doff = compressScalar(src, dst, soff, doff, soff + tail);
}

return doff;
}

where byte1_test = IntVector.species(Shape.S_256_BIT).broadcast(0x80) and
so on.

In my tests, compressing N times 32 fixed integers of 4-byte category
[0x200000,0x200000 + 32), I got the following results (with  openjdk
version "13-internal" 2019-09-17, vmargs:
--add-modules=jdk.incubator.vector):

Scalar:
N=100.000 (C:6ms, Java: 7ms)
N=1.000.000 (C:58ms, Java: 65ms)
N=10.000.000 (C: 577ms, Java: 647ms)

Vector:
N=100.000 (C:1ms, Java: 38ms)
N=1.000.000 (C:12ms, Java: 319ms)
N=10.000.000 (C: 115ms, Java: 2790ms)

(Roughly 5x improvement in C and 4x deterioration in Java)

Since I couldn't figure out where the bottleneck lies, I wrote a variant
where load and tests are vector based, but the actual encoding frames use
the scalar impl. In this case, the numbers I got were:

Test Vector/Encode Scalar:
N=100.000 (Java: 16ms)
N=1.000.000 (Java: 157ms)
N=10.000.000 (Java: 1483ms)

which is roughly 2x faster, meaning load/lessThan/allTrue operations seems
to be 'half the culprit'.

Right now are there any improvements that can be made to the code or any
XX:... option I'm missing?

Regards!