[vector] AVX2 ByteVector.shiftR performance and semantics
Vladimir Ivanov
vladimir.x.ivanov at oracle.com
Fri Feb 1 23:55:40 UTC 2019
On 01/02/2019 14:15, Richard Startin wrote:
> That's exactly what I mean - apply a shift to the integer interpretation
> of each four bytes and mask out the high bits appropriately. Otherwise I
> think users will go and look at the Intel intrinsics list for the
> platform they deploy their applications on, find there is no native 8
> bit shift and be tempted to do it themselves at the cost of clarity of
> expression.
>
> Incidentally, to bikeshed, I had missed "aShiftR" entirely despite
> having been watching this API for several months, because it is not
> prefixed by "shift" and I use my IDE to explore interfaces by prefix.
> End user vector API code may end up so logically dense that terse method
> names may be a hindrance: "shiftRightLogical" vs "shiftR",
> "shiftElementsRight" vs "shiftER", "shiftRightArithmetic" vs "aShiftR"
> or "shiftAR".
Yes, I find the names misleading as well and fully agree it's worth to
consider alternatives.
(Every time I use those methods I have to refresh my memory about
different terminology - ">>"/">>>", shiftR/aShiftR, signed/unsigned).
Best regards,
Vladimir Ivanov
> ------------------------------------------------------------------------
> *From:* Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> *Sent:* 01 February 2019 01:43
> *To:* Richard Startin; panama-dev at openjdk.java.net
> *Subject:* Re: [vector] AVX2 ByteVector.shiftR performance and semantics
>
>> I noticed that in the current implementation ByteVector.shiftR is much slower than the equivalent logic using IntVector for AVX2. I don't fully understand the compiled code in the ByteVector case but most of the time is spent inside VectorIntrinsics::broadcastInt (see attachment):
>>
>> @Benchmark
>> public int shiftRByte() {
>> return ((ByteVector)YMM_LONG.fromArray(data, 0)
>> .rebracket(YMM_BYTE)).shiftR(4).and((byte)0x0F)
>> .get(0);
>> }
>>
>> @Benchmark
>> public int shiftRInt() {
>> return ((ByteVector)((IntVector)YMM_LONG.fromArray(data, 0)
>> .rebracket(YMM_INT)).shiftR(4).and(0x0F0F0F0F)
>> .rebracket(YMM_BYTE))
>> .get(0);
>> }
>>
>> Benchmark (size) Mode Cnt Score Error Units
>> PopCount.shiftRByte 1024 thrpt 5 29.310 ± 0.680 ops/us
>> PopCount.shiftRInt 1024 thrpt 5 257.261 ± 17.210 ops/us
>>
>> It's not clear whether shiftR is a signed or unsigned shift. If it is an unsigned shift, I think it can be implemented efficiently in terms of ints, which already works very well.
>
> If I fully understands your idea, you suggest to do shift ints and then
> mask away upper bits per byte. It should work well for shifts by a
> constant, but requires a table with masks if the number of positions to
> shift is not known in advance.
>
> Anyway, I believe it should be faster than the current version for bytes.
>
>> Currently, ByteVector.shiftR seems to preserve sign, at least in the interpreter. Printing
>>
>> println(((ByteVector)YMM_LONG.fromArray(data, 0).rebracket(YMM_BYTE)));
>> println(((ByteVector)YMM_LONG.fromArray(data, 0).rebracket(YMM_BYTE)).shiftR(4));
>> println(((ByteVector)((IntVector)YMM_LONG.fromArray(data, 0).rebracket(YMM_INT)).shiftR(4).and(0x0F0F0F0F)
>> .rebracket(YMM_BYTE)));
>>
>> printed this:
>>
>> [9, 15, -41, 21, 85, 116, -18, 41, 30, -8, -40, -96, 119, -75, 119, -85, -48, -28, -55, 84, 118, 7, -46, 15, -103, -37, 48, -52, 14, -32, -105, -84]
>> [0, 0, -3, 1, 5, 7, -2, 2, 1, -1, -3, -6, 7, -5, 7, -6, -3, -2, -4, 5, 7, 0, -3, 0, -7, -3, 3, -4, 0, -2, -7, -6]
>> [0, 0, 13, 1, 5, 7, 14, 2, 1, 15, 13, 10, 7, 11, 7, 10, 13, 14, 12, 5, 7, 0, 13, 0, 9, 13, 3, 12, 0, 14, 9, 10]
>>
>> On the other hand, I saw that the shr instruction was used in the JIT compiled code.
>
> JIT does the right thing.
>
> shiftR() is meant to be unsigned right shift (>>>), but there's a bug in
> default implement for sub-word types (byte & short):
>
> (byte) (a >>> (i & 7))))
>
> `a` is sign-extended to int, unsigned shift is performed, and narrowing
> cast cuts the upper part:
>
> jshell> byte a = -1
> -1
>
> jshell> a >>> 1
> 2147483647
>
> jshell> (byte)(a >>> 1)
> -1
>
> The value should be zero-extended instead:
>
> jshell> (byte)(a & 0xFF >>> 1)
> 127
>
> Best regards,
> Vladimir Ivanov
More information about the panama-dev
mailing list