Replicating __mm256_shuffle_epi8 Intrinsic
Michael Ennen
mike.ennen at gmail.com
Sun Jul 11 03:34:57 UTC 2021
I finally realized that these multi-buffer implementations are used to
compute 8 different hashes at the same time, not to compute 8 blocks of one
digest at the same time.
Thus my original intention for exploring this, to speed up processing a
single SHA-256 hash is moot.
That leads me to my next question:
Will the vector API expose intrinsics such as the following:
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=sha256&expand=6206,6204,6205,6206
Or is that out of scope for the Vector API?
Thank you.
On Wed, Jul 7, 2021 at 2:47 PM Michael Ennen <mike.ennen at gmail.com> wrote:
> So far I have tried to copy the upstream code verbatim until I get it to
> match the results - however I am interested in what you're suggesting. How
> would that be done?
>
> On Wed, Jul 7, 2021 at 3:03 AM Radosław Smogura <mail at smogura.eu> wrote:
>
>> Michel,
>>
>> I wonder as well if you did consider using shuffles and vector operations
>> to load int vector, instead of using bytesToIntLE. I wonder if loading to
>> two vectors initially and permitting with shuffle would be better.
>>
>> Kind regards,
>> Rado
>> ------------------------------
>> *From:* Michael Ennen <mike.ennen at gmail.com>
>> *Sent:* Wednesday, July 7, 2021 08:17
>> *To:* Radosław Smogura <mail at smogura.eu>
>> *Cc:* Viswanathan, Sandhya <sandhya.viswanathan at intel.com>;
>> panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
>> *Subject:* Re: Replicating __mm256_shuffle_epi8 Intrinsic
>>
>> I figured it out. It was a mismatch of little and big endian. I can
>> reproduce the same shuffle result with:
>>
>> IntVector read8(byte[] chunk, int offset) {
>> System.out.println("read8, offset = " + offset);
>> IntVector ret = IntVector.fromArray(SPECIES_256, new int[] {
>> bytesToIntLE(chunk, 0 + offset),
>> bytesToIntLE(chunk, 64 + offset),
>> bytesToIntLE(chunk, 128 + offset),
>> bytesToIntLE(chunk, 192 + offset),
>> bytesToIntLE(chunk, 256 + offset),
>> bytesToIntLE(chunk, 320 + offset),
>> bytesToIntLE(chunk, 384 + offset),
>> bytesToIntLE(chunk, 448 + offset)}, 0);
>> System.out.println("read8 in: " + bytesToIntLE(chunk, 0 + offset) +
>> ", " + bytesToIntLE(chunk, 64 + offset) +
>> ", " + bytesToIntLE(chunk, 128 + offset) + ", " +
>> bytesToIntLE(chunk, 192 + offset) + ", " +
>> bytesToIntLE(chunk, 256 + offset) + ", " +
>> bytesToIntLE(chunk, 320 + offset) + ", " +
>> bytesToIntLE(chunk, 384 + offset) + ", " +
>> bytesToIntLE(chunk, 448 + offset));
>>
>> var shuffle = VectorShuffle.fromArray(ByteVector.SPECIES_256, new
>> int[]{
>> 12,13,14,15, 8, 9,10,11,
>> 4, 5, 6, 7, 0, 1, 2, 3,
>> 12,13,14,15, 8, 9,10,11,
>> 4, 5, 6, 7, 0, 1, 2, 3 }, 0);
>>
>> ByteVector shuffled = ret.reinterpretAsBytes().rearrange(shuffle,
>> shuffle.laneIsValid());
>>
>> System.out.println("read8 after shuffle: " +
>> IntVector.fromByteArray(SPECIES_256, shuffled.toArray(), 0,
>> ByteOrder.BIG_ENDIAN));
>> return IntVector.fromByteArray(SPECIES_256, shuffled.toArray(), 0,
>> ByteOrder.BIG_ENDIAN );
>> }
>>
>> Thanks for all your help.
>>
>> On Tue, Jul 6, 2021 at 11:10 PM Michael Ennen <mike.ennen at gmail.com>
>> wrote:
>>
>> Oh my gosh how embarrassing! I have been tweaking things so much in this
>> code I really needed to step back and take a closer look.
>>
>> I still don't get the right result (matching this:
>> https://github.com/brcolow/bitcoin-sha256/blob/master/src/sha256_avx2.cpp#L70
>> ).
>>
>> I will keep trying, though.
>>
>> On Tue, Jul 6, 2021 at 2:48 PM Radosław Smogura <mail at smogura.eu> wrote:
>>
>> Hi Michael,
>>
>> intoArray is not only for vector shuffle, and I think it's preffered way
>> to load and store data (as Sandhya used).
>>
>> Maybe this sound too simply, but I wonder if you are absolutely sure
>> that this line should look like this, and it should not print shuffled
>> vector? System.out.println("read8 returns: " + ret); :)
>>
>> Kind regards,
>> Rado
>> ------------------------------
>> *From:* Michael Ennen <mike.ennen at gmail.com>
>> *Sent:* Tuesday, July 6, 2021 21:29
>> *To:* Radosław Smogura <mail at smogura.eu>
>> *Cc:* Viswanathan, Sandhya <sandhya.viswanathan at intel.com>;
>> panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
>> *Subject:* Re: Replicating __mm256_shuffle_epi8 Intrinsic
>>
>> I am a bit confused by your example code. selectFrom returns a `Vector`,
>> but then you are calling `intoArray` which is a method only for
>> `VectorShuffle`.
>>
>> In addition to that - do you think you could use variable names from the
>> example for clarity:
>>
>>
>> https://github.com/brcolow/vector-sha256/blob/master/src/main/java/com/brcolow/vectorsha256/VectorSHA256.java#L450
>>
>> Thank you very much.
>>
>> On Tue, Jul 6, 2021 at 8:23 AM Radosław Smogura <mail at smogura.eu> wrote:
>>
>> Hi Michael,
>>
>> Shuffling can be problematic sometimes.
>>
>> I wonder if you tried something like this
>>
>> byteSwap = VectorShuffle.fromArray(BYTE_VECTOR_SPECIES, shuffleArr, 0);
>>
>> final var byteSwapVector = byteSwap.toVector();
>>
>> final var srcVector = ByteVector.fromArray(BYTE_VECTOR_SPECIES, src, i);
>> final var dstVector = byteSwapVector.selectFrom(srcVector);
>>
>> dstVector.intoArray(dst, i);
>>
>> Kind regards,
>> Rado
>>
>> ------------------------------
>> *From:* panama-dev <panama-dev-retn at openjdk.java.net> on behalf of
>> Michael Ennen <mike.ennen at gmail.com>
>> *Sent:* Tuesday, July 6, 2021 06:54
>> *To:* Viswanathan, Sandhya <sandhya.viswanathan at intel.com>
>> *Cc:* panama-dev at openjdk.java.net <panama-dev at openjdk.java.net>
>> *Subject:* Re: Replicating __mm256_shuffle_epi8 Intrinsic
>>
>> I understand that representing the input and output as 32-bit integers is
>> kind of confusing, but the point is, is that the shuffle as written isn't
>> doing anything. I have also tried:
>>
>> var shuffle = VectorShuffle.fromArray(ByteVector.SPECIES_256,
>> new int[]{
>> 12,13,14,15, 8, 9,10,11,
>> 4, 5, 6, 7, 0, 1, 2, 3,
>> 12,13,14,15, 8, 9,10,11,
>> 4, 5, 6, 7, 0, 1, 2, 3 }, 0)
>>
>> But still the returned vector is the same.
>>
>> On Sun, Jul 4, 2021 at 10:14 PM Michael Ennen <mike.ennen at gmail.com>
>> wrote:
>>
>> > Thanks for your assistance. I am having trouble replicating the same
>> > results from the first shuffle done in Bitcoin's SHA-AVX2:
>> >
>> > The following 8 integers are read in:
>> >
>> > Read8: 1684234849, 1886350957, 1684234849, 1886350957, 1684234849,
>> > 1886350957, 1684234849, 1886350957
>> >
>> > These 8 integers are shuffled with _mm256_shuffle_epi8 and the result
>> is:
>> >
>> > 1835954032, 1633837924, 1835954032, 1633837924, 1835954032, 1633837924,
>> > 1835954032, 1633837924
>> >
>> > But using your suggested code:
>> >
>> > var shuffle = VectorShuffle.fromOp(ByteVector.SPECIES_256, (i ->
>> > ((8+i)%16)));
>> > ByteVector shuffled = ret.reinterpretAsBytes().rearrange(shuffle,
>> > shuffle.laneIsValid());
>> > return IntVector.fromByteArray(SPECIES_256, shuffled.toArray(), 0,
>> > ByteOrder.LITTLE_ENDIAN);
>> >
>> > I get:
>> >
>> > 1684234849, 1886350957, 1684234849, 1886350957, 1684234849, 1886350957,
>> > 1684234849, 1886350957
>> >
>> > That is, the numbers don't seem to be changed.
>> >
>> > Thanks for your help.
>> >
>> > On Thu, Jul 1, 2021 at 11:18 AM Viswanathan, Sandhya <
>> > sandhya.viswanathan at intel.com> wrote:
>> >
>> >> Hi Michael,
>> >>
>> >> The rearrange() api should generate pshufb.
>> >>
>> >> e.g. for the following Java code:
>> >>
>> >> static final int SIZE = 1024;
>> >> static byte[] a = new byte[SIZE];
>> >> static byte[] r = new byte[SIZE];
>> >>
>> >> static final VectorSpecies<Byte> SPECIES = ByteVector.SPECIES_128;
>> >> static final VectorShuffle<Byte> HIGHTOLOW =
>> >> VectorShuffle.fromOp(SPECIES, (i -> ((8+i)%16)));
>> >>
>> >> static void workload() {
>> >> VectorShuffle<Byte> vshuf = HIGHTOLOW;
>> >>
>> >> for (int i = 0; i <= a.length - SPECIES.length(); i +=
>> >> SPECIES.length()) {
>> >> var av = ByteVector.fromArray(SPECIES, a, i);
>> >> var bv = av.rearrange(vshuf);
>> >> bv.intoArray(r, i);
>> >> }
>> >> }
>> >>
>> >> We generate the following code for the loop:
>> >> 0x00007fc388fa3180: vmovdqu 0x10(%rsi),%xmm1
>> >> 0x00007fc388fa3185: vmovdqu 0x10(%r14),%xmm2
>> >> 0x00007fc388fa318b: movslq %eax,%r10
>> >> 0x00007fc388fa318e: vmovdqu 0x10(%rbp,%r10,1),%xmm3
>> >> 0x00007fc388fa3195: vpcmpgtb %xmm2,%xmm1,%xmm1
>> >> 0x00007fc388fa3199: vptest %xmm0,%xmm1
>> >> 0x00007fc388fa319e: setne %r13b
>> >> 0x00007fc388fa31a2: movzbl %r13b,%r13d
>> >> 0x00007fc388fa31a6: test %r13d,%r13d
>> >> 0x00007fc388fa31a9: jne 0x00007fc388fa31e2
>> >> 0x00007fc388fa31ab: vpshufb %xmm2,%xmm3,%xmm3
>> >> 0x00007fc388fa31b0: vmovdqu %xmm3,0x10(%r8,%r10,1)
>> >> 0x00007fc388fa31b7: add $0x10,%eax
>> >> 0x00007fc388fa31ba: cmp %ebx,%eax
>> >> 0x00007fc388fa31bc: jl 0x00007fc388fa3180
>> >>
>> >> Best Regards,
>> >> Sandhya
>> >>
>> >>
>> >> -----Original Message-----
>> >> From: panama-dev <panama-dev-retn at openjdk.java.net> On Behalf Of
>> Michael
>> >> Ennen
>> >> Sent: Tuesday, June 29, 2021 11:20 PM
>> >> To: panama-dev at openjdk.java.net
>> >> Subject: Replicating __mm256_shuffle_epi8 Intrinsic
>> >>
>> >> I am trying to implement SHA-256 using the new Java Vector API.
>> >>
>> >> I have read the API docs but crossing the large mental gap of SIMD
>> >> instructions to the API for someone who knows very little SIMD has been
>> >> insurmountable for me.
>> >>
>> >> My question has been asked on Stack Overflow:
>> >>
>> >>
>> >>
>> https://stackoverflow.com/questions/68135596/replicating-mm256-shuffle-epi8-intrinsic-with-java-vector-api-shuffle
>> >>
>> >> It is quite a simple (to ask anyway) question, which is, how to
>> replicate
>> >> the _mm256_shuffle_epi8 intrinsic with the Java Vector API?
>> >>
>> >> Thanks very much.
>> >>
>> >> --
>> >> Michael Ennen
>> >>
>> >
>> >
>> > --
>> > Michael Ennen
>> >
>>
>>
>> --
>> Michael Ennen
>>
>>
>>
>> --
>> Michael Ennen
>>
>>
>>
>> --
>> Michael Ennen
>>
>>
>>
>> --
>> Michael Ennen
>>
>
>
> --
> Michael Ennen
>
--
Michael Ennen
More information about the panama-dev
mailing list