[vector] AVX2 ByteVector.shiftR performance and semantics

Fri Feb 1 01:43:17 UTC 2019

> I noticed that in the current implementation ByteVector.shiftR is much slower than the equivalent logic using IntVector for AVX2. I don't fully understand the compiled code in the ByteVector case but most of the time is spent inside VectorIntrinsics::broadcastInt (see attachment):
> 
>   @Benchmark
>    public int shiftRByte() {
>      return ((ByteVector)YMM_LONG.fromArray(data, 0)
>              .rebracket(YMM_BYTE)).shiftR(4).and((byte)0x0F)
>              .get(0);
>    }
> 
>    @Benchmark
>    public int shiftRInt() {
>      return ((ByteVector)((IntVector)YMM_LONG.fromArray(data, 0)
>              .rebracket(YMM_INT)).shiftR(4).and(0x0F0F0F0F)
>              .rebracket(YMM_BYTE))
>              .get(0);
>    }
> 
> Benchmark                         (size)   Mode  Cnt    Score    Error   Units
> PopCount.shiftRByte         1024  thrpt    5   29.310 ±  0.680  ops/us
> PopCount.shiftRInt             1024  thrpt    5  257.261 ± 17.210  ops/us
> 
> It's not clear whether shiftR is a signed or unsigned shift. If it is an unsigned shift, I think it can be implemented efficiently in terms of ints, which already works very well.

If I fully understands your idea, you suggest to do shift ints and then 
mask away upper bits per byte. It should work well for shifts by a 
constant, but requires a table with masks if the number of positions to 
shift is not known in advance.

Anyway, I believe it should be faster than the current version for bytes.

> Currently, ByteVector.shiftR seems to preserve sign, at least in the interpreter. Printing
> 
> println(((ByteVector)YMM_LONG.fromArray(data, 0).rebracket(YMM_BYTE)));
> println(((ByteVector)YMM_LONG.fromArray(data, 0).rebracket(YMM_BYTE)).shiftR(4));
> println(((ByteVector)((IntVector)YMM_LONG.fromArray(data, 0).rebracket(YMM_INT)).shiftR(4).and(0x0F0F0F0F)
> .rebracket(YMM_BYTE)));
> 
> printed this:
> 
> [9, 15, -41, 21, 85, 116, -18, 41, 30, -8, -40, -96, 119, -75, 119, -85, -48, -28, -55, 84, 118, 7, -46, 15, -103, -37, 48, -52, 14, -32, -105, -84]
> [0, 0, -3, 1, 5, 7, -2, 2, 1, -1, -3, -6, 7, -5, 7, -6, -3, -2, -4, 5, 7, 0, -3, 0, -7, -3, 3, -4, 0, -2, -7, -6]
> [0, 0, 13, 1, 5, 7, 14, 2, 1, 15, 13, 10, 7, 11, 7, 10, 13, 14, 12, 5, 7, 0, 13, 0, 9, 13, 3, 12, 0, 14, 9, 10]
> 
> On the other hand, I saw that the shr instruction was used in the JIT compiled code.

JIT does the right thing.

shiftR() is meant to be unsigned right shift (>>>), but there's a bug in 
default implement for sub-word types (byte & short):

     (byte) (a >>> (i & 7))))

`a` is sign-extended to int, unsigned shift is performed, and narrowing 
cast cuts the upper part:

jshell> byte a = -1
  -1

jshell> a >>> 1
  2147483647

jshell> (byte)(a >>> 1)
  -1

The value should be zero-extended instead:

jshell> (byte)(a & 0xFF >>> 1)
  127

Best regards,
Vladimir Ivanov