Vector API Questions

Wed Sep 8 16:30:06 UTC 2021

The approach I often use is JMH with the perfasm or dtraceasm profiling option.

If you are running a debug build you can use -XX:+TraceNewVectors or -XX:+PrintIntrinsics (the latter will show failures to make intrinsic vector operations).

Ideally developers should not need to ask the question you are asking, assuming hardware support. But it's a common one because we are stilling working on the implementation, filling in the performance gaps (apart of optimizing vector operations we need to look at inlining and vector calling conventions).

In the absence of hardware support it would be helpful if we could somehow more reliably inform.

Paul.

> On Sep 7, 2021, at 3:47 PM, Scott Palmer <swpalmer at gmail.com> wrote:
> 
> Thanks for the response.  I just tried my code with the 17-ea+35 build and the world makes sense again.  The vector code is about 10x faster than the simple byte-wise loop.
> 
> Are there any VM flags that will help me know if I’m using an operation that isn’t optimized?  As my algorithms get more complicated a step in the middle may be unoptimized but the faster operations around it may make that harder to detect.
> 
> Regards,
> 
> Scott
> 
> 
>> On Sep 7, 2021, at 3:39 PM, Paul Sandoz <paul.sandoz at oracle.com> wrote:
>> 
>> Hi Scott,
>> 
>> Many of the more “exotic" vector operations (conversions and rearranges) were not optimized in 16, but we made great progress for the soon to be released 17.
>> 
>> I wrote a benchmark based on your code and ran it against a build of https://urldefense.com/v3/__https://github.com/openjdk/panama-vector/tree/vectorIntrinsics__;!!ACWV5N9M2RV99hQ!dqqvOA1SwTJDru_FwfrA5tnvu69RLnL0RxW-ZSlceWn0P4Gk6XIjlOmEvhUMudTshg$ :
>> 
>> https://urldefense.com/v3/__https://gist.github.com/PaulSandoz/1da0b1fefd177eb4bdcb743b70851059__;!!ACWV5N9M2RV99hQ!dqqvOA1SwTJDru_FwfrA5tnvu69RLnL0RxW-ZSlceWn0P4Gk6XIjlOmEvhVF1GOgbQ$ 
>> 
>> Benchmark (size) Mode Cnt Score Error Units
>> BufferByteToShort.scalarAbsolute 1024 avgt 10 676.276 ± 2.455 ns/op
>> BufferByteToShort.scalarRelative 1024 avgt 10 1896.754 ± 40.293 ns/op
>> BufferByteToShort.vectorConvert 1024 avgt 10 108.458 ± 0.622 ns/op
>> BufferByteToShort.vectorShuffleMask 1024 avgt 10 154.176 ± 0.592 ns/op
>> BufferByteToShort.vectorShuffleZero 1024 avgt 10 1572.366 ± 8.717 ns/op
>> 
>> I wrote this very quickly so I hope I got it right.
>> 
>> There is still some work required to optimize shuffles with exceptional indexes, but overall we are making good progress.
>> 
>> Paul.
>> 
>>> On Sep 5, 2021, at 5:20 PM, Scott Palmer <swpalmer at gmail.com> wrote:
>>> 
>>> Yes, a shuffle works for the operation I was looking for.  Though my attempt at implementing the basic 8 bit -> 16 bit conversion loop in my original email is even slower when I tried a version using a shuffle instead of converting to shorts and shifting.
>>> 
>>> My basic loop to convert from bytes to shorts by shifting byte values to the high byte of the short is 70 times faster without using the Vector API in that case!
>>> 
>>> I have to wonder if I’m doing something horribly wrong.
>>> 
>>> It should be possible for this simple loop:
>>> 
>>>  void widen(ByteBuffer srcBuff , ByteBuffer dstBuff ) {
>>>        while (srcBuff.hasRemaining()) {
>>>            dstBuff.putShort((short) (srcBuff.get() << 8));
>>>        }
>>>  }
>>> 
>>> to go much faster when using vector instructions that operator on 32 bytes at a time.
>>> 
>>> But this is 70x slower:
>>> 
>>>  int si = 0;
>>>  int di = 0;
>>>  final int loopBound = ByteVector.SPECIES_PREFERRED.loopBound(srcBuff.remaining());
>>>  for (; si < loopBound; si += BYTE_SPECIES_LENGTH) {
>>>      ByteVector srcVec = ByteVector.fromByteBuffer(ByteVector.SPECIES_PREFERRED, srcBuff, si, NATIVE_ORDER);
>>>      srcVec.rearrange(EXPANDING_SHUFFLE_LO, ZERO).intoByteBuffer(dstBuff, di, NATIVE_ORDER);
>>>      di += BYTE_SPECIES_LENGTH;
>>>      srcVec.rearrange(EXPANDING_SHUFFLE_HI, ZERO).intoByteBuffer(dstBuff, di, NATIVE_ORDER);
>>>      di += BYTE_SPECIES_LENGTH;
>>>   }
>>> 
>>> 
>>> Both vector versions I’ve tried are at least an order of magnitude slower than the scalar code.  I’m I doing something wrong?
>>> 
>>> Thanks,
>>> 
>>> Scott
>>> 
>>>> On Sep 5, 2021, at 6:27 PM, John Rose <john.r.rose at oracle.com> wrote:
>>>> 
>>>> I think those are hardware specialized shuffles, right? So try a shuffle. We should try to recognize fixed shuffles that can be handled by special instructions. That’s the basic approach. It doesn’t need a new primitive but maybe some convenience functions and better instruction selection. 
>>>> 
>>>>> On Sep 5, 2021, at 1:09 PM, Scott Palmer <swpalmer at gmail.com> wrote:
>>>>> 
>>>>> Is this list appropriate for questions involving the Vector API?  (I scanned the list at https://mail.openjdk.java.net/mailman/listinfo but didn’t see anything)
>>>>> 
>>>>> E.g. questions such as, 
>>>>> 
>>>>> Are there plans to support operation XXX?  
>>>>> 
>>>>> I’m looking for something like: https://urldefense.com/v3/__https://www.felixcloutier.com/x86/punpcklbw:punpcklwd:punpckldq:punpcklqdq__;!!ACWV5N9M2RV99hQ!dqqvOA1SwTJDru_FwfrA5tnvu69RLnL0RxW-ZSlceWn0P4Gk6XIjlOmEvhUDpZwFzg$ 
>>>>> 
>>>>> Or performance related queries?
>>>>> 
>>>>> I tried to convert this simple operation to use the Vector API (on 2MB source buffers) and the result iI got was 15x slower than this byte-wise loop:
>>>>> 
>>>>> void widen(ByteBuffer srcBuff , ByteBuffer dstBuff ) {
>>>>>        while (srcBuff.hasRemaining()) {
>>>>>            dstBuff.putShort((short) (srcBuff.get() << 8));
>>>>>        }
>>>>> }
>>>>> 
>>>>> 
>>>>> My attempt using JDK 16 on macOS (Intel) looked like:
>>>>> 
>>>>> final int BYTE_PREFERRED_SPECIES_LENGTH = ByteVector.SPECIES_PREFERRED.length();
>>>>> final int loopBound = ByteVector.SPECIES_PREFERRED.loopBound(srcBuff.remaining());
>>>>> int si = 0;
>>>>> int di = 0;
>>>>> for (; si < loopBound; si += BYTE_PREFERRED_SPECIES_LENGTH) {
>>>>>            ByteVector srcVec = ByteVector.fromByteBuffer(ByteVector.SPECIES_PREFERRED, srcBuff, si, NATIVE_ORDER);
>>>>>            srcVec.convert(VectorOperators.B2S, 0)
>>>>>                    .lanewise(VectorOperators.LSHL, 8)
>>>>>                    .intoByteBuffer(dstBuff, di, NATIVE_ORDER);
>>>>>            di += BYTE_PREFERRED_SPECIES_LENGTH;
>>>>> 
>>>>>            srcVec.convert(VectorOperators.B2S, 1)
>>>>>                    .lanewise(VectorOperators.LSHL, 8)
>>>>>                    .intoByteBuffer(dstBuff, PD_di, NATIVE_ORDER);
>>>>>            di += BYTE_PREFERRED_SPECIES_LENGTH;
>>>>> }
>>>>> 
>>>>> If this is the wrong place for these kinds of questions, please point me in the right direction.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Scott
>>> 
>> 
>