Vector API Questions

Mon Sep 6 00:20:27 UTC 2021

Yes, a shuffle works for the operation I was looking for.  Though my attempt at implementing the basic 8 bit -> 16 bit conversion loop in my original email is even slower when I tried a version using a shuffle instead of converting to shorts and shifting.

My basic loop to convert from bytes to shorts by shifting byte values to the high byte of the short is 70 times faster without using the Vector API in that case!

I have to wonder if I’m doing something horribly wrong.

It should be possible for this simple loop:

    void widen(ByteBuffer srcBuff , ByteBuffer dstBuff ) {
          while (srcBuff.hasRemaining()) {
              dstBuff.putShort((short) (srcBuff.get() << 8));
          }
    }

to go much faster when using vector instructions that operator on 32 bytes at a time.

But this is 70x slower:

    int si = 0;
    int di = 0;
    final int loopBound = ByteVector.SPECIES_PREFERRED.loopBound(srcBuff.remaining());
    for (; si < loopBound; si += BYTE_SPECIES_LENGTH) {
        ByteVector srcVec = ByteVector.fromByteBuffer(ByteVector.SPECIES_PREFERRED, srcBuff, si, NATIVE_ORDER);
        srcVec.rearrange(EXPANDING_SHUFFLE_LO, ZERO).intoByteBuffer(dstBuff, di, NATIVE_ORDER);
        di += BYTE_SPECIES_LENGTH;
        srcVec.rearrange(EXPANDING_SHUFFLE_HI, ZERO).intoByteBuffer(dstBuff, di, NATIVE_ORDER);
        di += BYTE_SPECIES_LENGTH;
     }

Both vector versions I’ve tried are at least an order of magnitude slower than the scalar code.  I’m I doing something wrong?

Thanks,

Scott

> On Sep 5, 2021, at 6:27 PM, John Rose <john.r.rose at oracle.com> wrote:
> 
> I think those are hardware specialized shuffles, right? So try a shuffle. We should try to recognize fixed shuffles that can be handled by special instructions. That’s the basic approach. It doesn’t need a new primitive but maybe some convenience functions and better instruction selection. 
> 
>> On Sep 5, 2021, at 1:09 PM, Scott Palmer <swpalmer at gmail.com> wrote:
>> 
>> Is this list appropriate for questions involving the Vector API?  (I scanned the list at https://mail.openjdk.java.net/mailman/listinfo but didn’t see anything)
>> 
>> E.g. questions such as, 
>> 
>> Are there plans to support operation XXX?  
>> 
>> I’m looking for something like: https://www.felixcloutier.com/x86/punpcklbw:punpcklwd:punpckldq:punpcklqdq
>> 
>> Or performance related queries?
>> 
>> I tried to convert this simple operation to use the Vector API (on 2MB source buffers) and the result iI got was 15x slower than this byte-wise loop:
>> 
>>    void widen(ByteBuffer srcBuff , ByteBuffer dstBuff ) {
>>           while (srcBuff.hasRemaining()) {
>>               dstBuff.putShort((short) (srcBuff.get() << 8));
>>           }
>>   }
>> 
>> 
>> My attempt using JDK 16 on macOS (Intel) looked like:
>> 
>>   final int BYTE_PREFERRED_SPECIES_LENGTH = ByteVector.SPECIES_PREFERRED.length();
>>   final int loopBound = ByteVector.SPECIES_PREFERRED.loopBound(srcBuff.remaining());
>>   int si = 0;
>>   int di = 0;
>>   for (; si < loopBound; si += BYTE_PREFERRED_SPECIES_LENGTH) {
>>               ByteVector srcVec = ByteVector.fromByteBuffer(ByteVector.SPECIES_PREFERRED, srcBuff, si, NATIVE_ORDER);
>>               srcVec.convert(VectorOperators.B2S, 0)
>>                       .lanewise(VectorOperators.LSHL, 8)
>>                       .intoByteBuffer(dstBuff, di, NATIVE_ORDER);
>>               di += BYTE_PREFERRED_SPECIES_LENGTH;
>> 
>>               srcVec.convert(VectorOperators.B2S, 1)
>>                       .lanewise(VectorOperators.LSHL, 8)
>>                       .intoByteBuffer(dstBuff, PD_di, NATIVE_ORDER);
>>               di += BYTE_PREFERRED_SPECIES_LENGTH;
>>   }
>> 
>> If this is the wrong place for these kinds of questions, please point me in the right direction.
>> 
>> Thanks,
>> 
>> Scott