Replicating __mm256_shuffle_epi8 Intrinsic

Paul Sandoz paul.sandoz at oracle.com
Mon Jul 19 21:41:59 UTC 2021


Possibly, we call them “snowflakes” because as they are special (same for AES-based instructions). The current focus is on more general operations applicable across all or categories of vectors (integral or floating point) and CPU architectures.

Paul.

> On Jul 10, 2021, at 8:34 PM, Michael Ennen <mike.ennen at gmail.com> wrote:
> 
> I finally realized that these multi-buffer implementations are used to
> compute 8 different hashes at the same time, not to compute 8 blocks of one
> digest at the same time.
> 
> Thus my original intention for exploring this, to speed up processing a
> single SHA-256 hash is moot.
> 
> That leads me to my next question:
> 
> Will the vector API expose intrinsics such as the following:
> https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=sha256&expand=6206,6204,6205,6206
> 
> Or is that out of scope for the Vector API?
> 
> Thank you.
> 
> On Wed, Jul 7, 2021 at 2:47 PM Michael Ennen <mike.ennen at gmail.com> wrote:
> 
>> So far I have tried to copy the upstream code verbatim until I get it to
>> match the results - however I am interested in what you're suggesting. How
>> would that be done?
>> 
>> On Wed, Jul 7, 2021 at 3:03 AM Radosław Smogura <mail at smogura.eu> wrote:
>> 
>>> Michel,
>>> 
>>> I wonder as well if you did consider using shuffles and vector operations
>>> to load int vector, instead of using bytesToIntLE. I wonder if loading to
>>> two vectors initially and permitting with shuffle would be better.
>>> 
>>> Kind regards,
>>> Rado
>>> ------------------------------
>>> *From:* Michael Ennen <mike.ennen at gmail.com>
>>> *Sent:* Wednesday, July 7, 2021 08:17
>>> *To:* Radosław Smogura <mail at smogura.eu>
>>> *Cc:* Viswanathan, Sandhya <sandhya.viswanathan at intel.com>;
>>> panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
>>> *Subject:* Re: Replicating __mm256_shuffle_epi8 Intrinsic
>>> 
>>> I figured it out. It was a mismatch of little and big endian. I can
>>> reproduce the same shuffle result with:
>>> 
>>> IntVector read8(byte[] chunk, int offset) {
>>>    System.out.println("read8, offset = " + offset);
>>>    IntVector ret = IntVector.fromArray(SPECIES_256, new int[] {
>>>            bytesToIntLE(chunk, 0 + offset),
>>>            bytesToIntLE(chunk, 64 + offset),
>>>            bytesToIntLE(chunk, 128 + offset),
>>>            bytesToIntLE(chunk, 192 + offset),
>>>            bytesToIntLE(chunk, 256 + offset),
>>>            bytesToIntLE(chunk, 320 + offset),
>>>            bytesToIntLE(chunk, 384 + offset),
>>>            bytesToIntLE(chunk, 448 + offset)}, 0);
>>>    System.out.println("read8 in: " + bytesToIntLE(chunk, 0 + offset) +
>>> ", " + bytesToIntLE(chunk, 64 + offset) +
>>>            ", " + bytesToIntLE(chunk, 128 + offset) + ", " +
>>> bytesToIntLE(chunk, 192 + offset) + ", " +
>>>            bytesToIntLE(chunk, 256 + offset) + ", " +
>>> bytesToIntLE(chunk, 320 + offset) + ", " +
>>>            bytesToIntLE(chunk, 384 + offset) + ", " +
>>> bytesToIntLE(chunk, 448 + offset));
>>> 
>>>    var shuffle = VectorShuffle.fromArray(ByteVector.SPECIES_256, new
>>> int[]{
>>>            12,13,14,15,   8, 9,10,11,
>>>            4, 5, 6, 7,    0, 1, 2, 3,
>>>            12,13,14,15,   8, 9,10,11,
>>>            4, 5, 6, 7,    0, 1, 2, 3 }, 0);
>>> 
>>>    ByteVector shuffled = ret.reinterpretAsBytes().rearrange(shuffle,
>>> shuffle.laneIsValid());
>>> 
>>>    System.out.println("read8 after shuffle: " +
>>> IntVector.fromByteArray(SPECIES_256, shuffled.toArray(), 0,
>>> ByteOrder.BIG_ENDIAN));
>>>    return IntVector.fromByteArray(SPECIES_256, shuffled.toArray(), 0,
>>> ByteOrder.BIG_ENDIAN );
>>> }
>>> 
>>> Thanks for all your help.
>>> 
>>> On Tue, Jul 6, 2021 at 11:10 PM Michael Ennen <mike.ennen at gmail.com>
>>> wrote:
>>> 
>>> Oh my gosh how embarrassing! I have been tweaking things so much in this
>>> code I really needed to step back and take a closer look.
>>> 
>>> I still don't get the right result (matching this:
>>> https://github.com/brcolow/bitcoin-sha256/blob/master/src/sha256_avx2.cpp#L70
>>> ).
>>> 
>>> I will keep trying, though.
>>> 
>>> On Tue, Jul 6, 2021 at 2:48 PM Radosław Smogura <mail at smogura.eu> wrote:
>>> 
>>> Hi Michael,
>>> 
>>> intoArray is not only for vector shuffle, and I think it's preffered way
>>> to load and store data (as Sandhya used).
>>> 
>>> Maybe this sound too simply, but I wonder if you are absolutely sure
>>> that this line should look like this, and it should not print shuffled
>>> vector? System.out.println("read8 returns: " + ret); :)
>>> 
>>> Kind regards,
>>> Rado
>>> ------------------------------
>>> *From:* Michael Ennen <mike.ennen at gmail.com>
>>> *Sent:* Tuesday, July 6, 2021 21:29
>>> *To:* Radosław Smogura <mail at smogura.eu>
>>> *Cc:* Viswanathan, Sandhya <sandhya.viswanathan at intel.com>;
>>> panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
>>> *Subject:* Re: Replicating __mm256_shuffle_epi8 Intrinsic
>>> 
>>> I am a bit confused by your example code. selectFrom returns a `Vector`,
>>> but then you are calling `intoArray` which is a method only for
>>> `VectorShuffle`.
>>> 
>>> In addition to that - do you think you could use variable names from the
>>> example for clarity:
>>> 
>>> 
>>> https://github.com/brcolow/vector-sha256/blob/master/src/main/java/com/brcolow/vectorsha256/VectorSHA256.java#L450
>>> 
>>> Thank you very much.
>>> 
>>> On Tue, Jul 6, 2021 at 8:23 AM Radosław Smogura <mail at smogura.eu> wrote:
>>> 
>>> Hi Michael,
>>> 
>>> Shuffling can be problematic sometimes.
>>> 
>>> I wonder if you tried something like this
>>> 
>>> byteSwap = VectorShuffle.fromArray(BYTE_VECTOR_SPECIES, shuffleArr, 0);
>>> 
>>> final var byteSwapVector = byteSwap.toVector();
>>> 
>>> final var srcVector =  ByteVector.fromArray(BYTE_VECTOR_SPECIES, src, i);
>>> final var dstVector = byteSwapVector.selectFrom(srcVector);
>>> 
>>> dstVector.intoArray(dst, i);
>>> 
>>> Kind regards,
>>> Rado
>>> 
>>> ------------------------------
>>> *From:* panama-dev <panama-dev-retn at openjdk.java.net> on behalf of
>>> Michael Ennen <mike.ennen at gmail.com>
>>> *Sent:* Tuesday, July 6, 2021 06:54
>>> *To:* Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
>>> *Cc:* panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
>>> *Subject:* Re: Replicating __mm256_shuffle_epi8 Intrinsic
>>> 
>>> I understand that representing the input and output as 32-bit integers is
>>> kind of confusing, but the point is, is that the shuffle as written isn't
>>> doing anything. I have also tried:
>>> 
>>>            var shuffle = VectorShuffle.fromArray(ByteVector.SPECIES_256,
>>> new int[]{
>>>                    12,13,14,15,   8, 9,10,11,
>>>                    4, 5, 6, 7,    0, 1, 2, 3,
>>>                    12,13,14,15,   8, 9,10,11,
>>>                    4, 5, 6, 7,    0, 1, 2, 3 }, 0)
>>> 
>>> But still the returned vector is the same.
>>> 
>>> On Sun, Jul 4, 2021 at 10:14 PM Michael Ennen <mike.ennen at gmail.com>
>>> wrote:
>>> 
>>>> Thanks for your assistance. I am having trouble replicating the same
>>>> results from the first shuffle done in Bitcoin's SHA-AVX2:
>>>> 
>>>> The following 8 integers are read in:
>>>> 
>>>> Read8: 1684234849, 1886350957, 1684234849, 1886350957, 1684234849,
>>>> 1886350957, 1684234849, 1886350957
>>>> 
>>>> These 8 integers are shuffled with _mm256_shuffle_epi8 and the result
>>> is:
>>>> 
>>>> 1835954032, 1633837924, 1835954032, 1633837924, 1835954032, 1633837924,
>>>> 1835954032, 1633837924
>>>> 
>>>> But using your suggested code:
>>>> 
>>>> var shuffle = VectorShuffle.fromOp(ByteVector.SPECIES_256, (i ->
>>>> ((8+i)%16)));
>>>> ByteVector shuffled = ret.reinterpretAsBytes().rearrange(shuffle,
>>>> shuffle.laneIsValid());
>>>> return IntVector.fromByteArray(SPECIES_256, shuffled.toArray(), 0,
>>>> ByteOrder.LITTLE_ENDIAN);
>>>> 
>>>> I get:
>>>> 
>>>> 1684234849, 1886350957, 1684234849, 1886350957, 1684234849, 1886350957,
>>>> 1684234849, 1886350957
>>>> 
>>>> That is, the numbers don't seem to be changed.
>>>> 
>>>> Thanks for your help.
>>>> 
>>>> On Thu, Jul 1, 2021 at 11:18 AM Viswanathan, Sandhya <
>>>> sandhya.viswanathan at intel.com> wrote:
>>>> 
>>>>> Hi Michael,
>>>>> 
>>>>> The rearrange() api should generate pshufb.
>>>>> 
>>>>> e.g. for the following Java code:
>>>>> 
>>>>>   static final int SIZE = 1024;
>>>>>   static byte[] a = new byte[SIZE];
>>>>>   static byte[] r = new byte[SIZE];
>>>>> 
>>>>>   static final VectorSpecies<Byte> SPECIES = ByteVector.SPECIES_128;
>>>>>   static final VectorShuffle<Byte> HIGHTOLOW =
>>>>> VectorShuffle.fromOp(SPECIES, (i -> ((8+i)%16)));
>>>>> 
>>>>>   static void workload() {
>>>>>       VectorShuffle<Byte> vshuf = HIGHTOLOW;
>>>>> 
>>>>>       for (int i = 0; i <= a.length - SPECIES.length(); i +=
>>>>> SPECIES.length()) {
>>>>>           var av = ByteVector.fromArray(SPECIES, a, i);
>>>>>           var bv = av.rearrange(vshuf);
>>>>>           bv.intoArray(r, i);
>>>>>       }
>>>>>   }
>>>>> 
>>>>> We generate the following code for the loop:
>>>>> 0x00007fc388fa3180:   vmovdqu 0x10(%rsi),%xmm1
>>>>> 0x00007fc388fa3185:   vmovdqu 0x10(%r14),%xmm2
>>>>> 0x00007fc388fa318b:   movslq %eax,%r10
>>>>> 0x00007fc388fa318e:   vmovdqu 0x10(%rbp,%r10,1),%xmm3
>>>>> 0x00007fc388fa3195:   vpcmpgtb %xmm2,%xmm1,%xmm1
>>>>> 0x00007fc388fa3199:   vptest %xmm0,%xmm1
>>>>> 0x00007fc388fa319e:   setne  %r13b
>>>>> 0x00007fc388fa31a2:   movzbl %r13b,%r13d
>>>>> 0x00007fc388fa31a6:   test   %r13d,%r13d
>>>>> 0x00007fc388fa31a9:   jne    0x00007fc388fa31e2
>>>>> 0x00007fc388fa31ab:   vpshufb %xmm2,%xmm3,%xmm3
>>>>> 0x00007fc388fa31b0:   vmovdqu %xmm3,0x10(%r8,%r10,1)
>>>>> 0x00007fc388fa31b7:   add    $0x10,%eax
>>>>> 0x00007fc388fa31ba:   cmp    %ebx,%eax
>>>>> 0x00007fc388fa31bc:   jl     0x00007fc388fa3180
>>>>> 
>>>>> Best Regards,
>>>>> Sandhya
>>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: panama-dev <panama-dev-retn at openjdk.java.net> On Behalf Of
>>> Michael
>>>>> Ennen
>>>>> Sent: Tuesday, June 29, 2021 11:20 PM
>>>>> To: panama-dev at openjdk.java.net
>>>>> Subject: Replicating __mm256_shuffle_epi8 Intrinsic
>>>>> 
>>>>> I am trying to implement SHA-256 using the new Java Vector API.
>>>>> 
>>>>> I have read the API docs but crossing the large mental gap of SIMD
>>>>> instructions to the API for someone who knows very little SIMD has been
>>>>> insurmountable for me.
>>>>> 
>>>>> My question has been asked on Stack Overflow:
>>>>> 
>>>>> 
>>>>> 
>>> https://stackoverflow.com/questions/68135596/replicating-mm256-shuffle-epi8-intrinsic-with-java-vector-api-shuffle
>>>>> 
>>>>> It is quite a simple (to ask anyway) question, which is, how to
>>> replicate
>>>>> the _mm256_shuffle_epi8 intrinsic with the Java Vector API?
>>>>> 
>>>>> Thanks very much.
>>>>> 
>>>>> --
>>>>> Michael Ennen
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Michael Ennen
>>>> 
>>> 
>>> 
>>> --
>>> Michael Ennen
>>> 
>>> 
>>> 
>>> --
>>> Michael Ennen
>>> 
>>> 
>>> 
>>> --
>>> Michael Ennen
>>> 
>>> 
>>> 
>>> --
>>> Michael Ennen
>>> 
>> 
>> 
>> --
>> Michael Ennen
>> 
> 
> 
> -- 
> Michael Ennen



More information about the panama-dev mailing list