[vector api] VarInt Compression - Am I doing something wrong?

Wed Mar 13 00:21:38 UTC 2019

Hi Cleber,

Thanks a lot for trying out vector api and your feedback.

Looking at your code, one of the reasons the performance could be low is that we don’t have the optimized implementation for elemental shifts and rotates yet (shiftER, shiftEL, rotateER, rotateEL).  This will result in boxing/unboxing before/after these calls and so a slowdown. We are in the process of implementing optimized elemental shifts and rotates and plan to have them in place in next couple of weeks. 

Best Regards,
Sandhya

-----Original Message-----
From: panama-dev [mailto:panama-dev-bounces at openjdk.java.net] On Behalf Of Cleber Muramoto
Sent: Sunday, March 10, 2019 9:40 AM
To: panama-dev at openjdk.java.net
Subject: [vector api] VarInt Compression - Am I doing something wrong?

Hello guys, recently I've been playing with vector API in an attempt to
implement variable length integer encoding.

Basically, what I'm doing is testing if 8 contiguous integers simultaneosly
fall into any of the 4 ranges: [0,0x80), [0x80,0x4000), [0x4000,0x200000)
or [0x200000,0x10000000), so (parts of) the following scalar code can be
vectorized:

public static int compressScalar(int[] src, byte[] dst, int soff, int doff,
int slim) {
for (int i = soff; i < slim; i++) {
int v = src[i];

if (v >>> 7 == 0) {
dst[doff++] = (byte) v;
} else if (v >>> 14 == 0) {
dst[doff++] = (byte) (v & 0x7F | 0x80);
dst[doff++] = (byte) (v >>> 7);
} else if (v >>> 21 == 0) {
dst[doff++] = (byte) (v & 0x7F | 0x80);
dst[doff++] = (byte) (v >>> 7 | 0x80);
dst[doff++] = (byte) (v >>> 14);
} else if (v >>> 28 == 0) {
dst[doff++] = (byte) (v & 0x7F | 0x80);
dst[doff++] = (byte) (v >>> 7 | 0x80);
dst[doff++] = (byte) (v >>> 14 | 0x80);
dst[doff++] = (byte) (v >>> 21);
} else {
dst[doff++] = (byte) (v & 0x7F | 0x80);
dst[doff++] = (byte) (v >>> 7 | 0x80);
dst[doff++] = (byte) (v >>> 14 | 0x80);
dst[doff++] = (byte) (v >>> 21 | 0x80);
dst[doff++] = (byte) (v >>> 28);
}
}
return doff;
}

The corresponding vectorized code is an attempt to port a C implementation (
https://github.com/cmuramoto/panama-playground/blob/master/c/compressor.c)
and looks like this:

public static int compress(int[] src, byte[] dst, int soff, int doff) {
int slim = (src.length - src.length % 8) - 8;
int tail = (src.length - soff) % 8;

while (soff <= slim) {
var chunk = _mm256i.fromArray(src, soff, allTrue);

if (chunk.lessThan(byte1_test).allTrue()) {
//encode and write 8 bytes
} else if (chunk.lessThan(byte2_test).allTrue()) {
//encod and write 16 bytes
} else if (chunk.lessThan(byte3_test).allTrue()) {
//encode and write 24 bytes
} else if (chunk.lessThan(byte4_test).allTrue()) {
var q0 = (ByteVector)
chunk.and(0x7F).or(0x80).and(0xFF).reinterpret(_mm256i_byte);

var q1 = (ByteVector)
chunk.shiftR(7).or(0x80).and(0xFF).reinterpret(_mm256i_byte);
q1 = q1.shiftER(1);

var q2 = (ByteVector)
chunk.shiftR(14).or(0x80).and(0xFF).reinterpret(_mm256i_byte);
q2 = q2.shiftER(2);

var q3 = (ByteVector) chunk.shiftR(21).and(0xFF).reinterpret(_mm256i_byte);
q3 = q3.shiftER(3);

var store = q0.or(q1.or(q2.or(q3)));

store.intoArray(dst, doff);

doff += 32;
} else {
// fallback
doff = compressScalar(src, dst, soff, doff, soff + 8);
}

soff += 8;
}

if (tail != 0) {
doff = compressScalar(src, dst, soff, doff, soff + tail);
}

return doff;
}

where byte1_test = IntVector.species(Shape.S_256_BIT).broadcast(0x80) and
so on.

In my tests, compressing N times 32 fixed integers of 4-byte category
[0x200000,0x200000 + 32), I got the following results (with  openjdk
version "13-internal" 2019-09-17, vmargs:
--add-modules=jdk.incubator.vector):

Scalar:
N=100.000 (C:6ms, Java: 7ms)
N=1.000.000 (C:58ms, Java: 65ms)
N=10.000.000 (C: 577ms, Java: 647ms)

Vector:
N=100.000 (C:1ms, Java: 38ms)
N=1.000.000 (C:12ms, Java: 319ms)
N=10.000.000 (C: 115ms, Java: 2790ms)

(Roughly 5x improvement in C and 4x deterioration in Java)

Since I couldn't figure out where the bottleneck lies, I wrote a variant
where load and tests are vector based, but the actual encoding frames use
the scalar impl. In this case, the numbers I got were:

Test Vector/Encode Scalar:
N=100.000 (Java: 16ms)
N=1.000.000 (Java: 157ms)
N=10.000.000 (Java: 1483ms)

which is roughly 2x faster, meaning load/lessThan/allTrue operations seems
to be 'half the culprit'.

Right now are there any improvements that can be made to the code or any
XX:... option I'm missing?

Regards!