RFR: 8290322: Optimize Vector.rearrange over byte vectors for AVX512BW targets. [v3]
Jatin Bhateja
jbhateja at openjdk.org
Fri Aug 19 18:58:51 UTC 2022
On Wed, 17 Aug 2022 23:19:02 GMT, Sandhya Viswanathan <sviswanathan at openjdk.org> wrote:
>> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision:
>>
>> - 8290322: Review comments resolution.
>> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8290322
>> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8290322
>> - 8290322: Optimize Vector.rearrange over byte vectors for AVX512BW targets.
>
> With above suggestions, the rearrange_bytes() kernel would look something like below:
> movl(rtmp, 16);
> evpbroadcastb(xtmp2, rtmp, vlen_enc);
>
> // Compute a mask for shuffle vector by comparing indices with expression INDEX < 16,
> // Broadcast first 128 bit lane across entire vector, shuffle the vector lanes using
> // original shuffle indices and move the shuffled lanes corresponding to true
> // mask to destination vector.
> evpcmpb(ktmp2, k0, shuffle, xtmp2, Assembler::lt, true, vlen_enc);
> evshuffi64x2(xtmp3, k0, src, src, 0x0, false, vlen_enc);
> evpshufb(dst, ktmp2, xtmp3, shuffle, false, vlen_enc);
>
> // Perform above steps with lane comparison expression as INDEX >= 16 && INDEX < 32
> // and broadcasting second 128 bit lane.
>
> evpcmpb(ktmp1, k0, shuffle, xtmp2, Assembler::nlt, true, vlen_enc);
> vpsllq(xtmp5, xtmp2, 0x1, vlen_enc);
> evpcmpb(ktmp2, ktmp1, shuffle, xtmp5, Assembler::lt, true, vlen_enc);
> evshuffi64x2(xtmp3, k0, src, src, 0x55, false, vlen_enc);
> evpshufb(dst, ktmp2, xtmp3, shuffle, true, vlen_enc);
>
> // Perform above steps with lane comparison expression as INDEX >= 32 && INDEX < 48
> // and broadcasting third 128 bit lane.
> evpcmpb(ktmp1, k0, shuffle, xtmp5, Assembler::nlt, true, vlen_enc);
> vpaddb(xtmp5, xtmp2, xtmp5, vlen_enc);
> evpcmpb(ktmp2, ktmp1, shuffle, xtmp5, Assembler::lt, true, vlen_enc);
> evshuffi64x2(xtmp3, k0, src, src, 0xaa, false, vlen_enc);
> evpshufb(dst, ktmp2, xtmp3, shuffle, true, vlen_enc);
>
> // Perform above steps with lane comparison expression as INDEX >= 48 && INDEX < 64
> // and broadcasting third 128 bit lane.
> evpcmpb(ktmp1, k0, shuffle, xtmp5, Assembler::nlt, true, vlen_enc);
> vpsllq(xtmp5, xtmp2, 0x2, vlen_enc);
> evpcmpb(ktmp2, ktmp1, shuffle, xtmp5, Assembler::lt, true, vlen_enc);
> evshuffi64x2(xtmp3, k0, src, src, 0xff, false, vlen_enc);
> evpshufb(dst, ktmp2, xtmp3, shuffle, true, vlen_enc);
>
> The number of xtmp and ktmp registers could also be further reduced.
Hi @sviswa7 , your comments have been addressed.
Hi @vnkozlov can you kindly re-run this through your test framework.
-------------
PR: https://git.openjdk.org/jdk/pull/9498
More information about the hotspot-compiler-dev
mailing list