RFR: 8290322: Optimize Vector.rearrange over byte vectors for AVX512BW targets. [v3]

Jatin Bhateja jbhateja at openjdk.org
Fri Aug 19 18:58:51 UTC 2022


On Wed, 17 Aug 2022 23:19:02 GMT, Sandhya Viswanathan <sviswanathan at openjdk.org> wrote:

>> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision:
>> 
>>  - 8290322: Review comments resolution.
>>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8290322
>>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8290322
>>  - 8290322: Optimize Vector.rearrange over byte vectors for AVX512BW targets.
>
> With above suggestions, the rearrange_bytes() kernel would look something like below:
>   movl(rtmp, 16);
>   evpbroadcastb(xtmp2, rtmp, vlen_enc);
> 
>   // Compute a mask for shuffle vector by comparing indices with expression INDEX < 16,
>   // Broadcast first 128 bit lane across entire vector, shuffle the vector lanes using
>   // original shuffle indices and move the shuffled lanes corresponding to true
>   // mask to destination vector.
>   evpcmpb(ktmp2, k0, shuffle, xtmp2, Assembler::lt, true, vlen_enc);
>   evshuffi64x2(xtmp3, k0, src, src, 0x0, false, vlen_enc);
>   evpshufb(dst, ktmp2, xtmp3, shuffle, false, vlen_enc);
> 
>   // Perform above steps with lane comparison expression as INDEX >= 16 && INDEX < 32
>   // and broadcasting second 128 bit lane.
> 
>   evpcmpb(ktmp1, k0, shuffle,  xtmp2, Assembler::nlt, true, vlen_enc);
>   vpsllq(xtmp5, xtmp2, 0x1, vlen_enc);
>   evpcmpb(ktmp2, ktmp1, shuffle, xtmp5, Assembler::lt, true, vlen_enc);
>   evshuffi64x2(xtmp3, k0, src, src, 0x55, false, vlen_enc);
>   evpshufb(dst, ktmp2, xtmp3, shuffle, true, vlen_enc);
> 
>   // Perform above steps with lane comparison expression as INDEX >= 32 && INDEX < 48
>   // and broadcasting third 128 bit lane.
>   evpcmpb(ktmp1, k0, shuffle,  xtmp5, Assembler::nlt, true, vlen_enc);
>   vpaddb(xtmp5, xtmp2, xtmp5, vlen_enc);
>   evpcmpb(ktmp2, ktmp1, shuffle,  xtmp5, Assembler::lt, true, vlen_enc);
>   evshuffi64x2(xtmp3, k0, src, src, 0xaa, false, vlen_enc);
>   evpshufb(dst, ktmp2, xtmp3, shuffle, true, vlen_enc);
> 
>   // Perform above steps with lane comparison expression as INDEX >= 48 && INDEX < 64
>   // and broadcasting third 128 bit lane.
>   evpcmpb(ktmp1, k0, shuffle,  xtmp5, Assembler::nlt, true, vlen_enc);
>   vpsllq(xtmp5, xtmp2, 0x2, vlen_enc);
>   evpcmpb(ktmp2, ktmp1, shuffle,  xtmp5, Assembler::lt, true, vlen_enc);
>   evshuffi64x2(xtmp3, k0, src, src, 0xff, false, vlen_enc);
>   evpshufb(dst, ktmp2, xtmp3, shuffle, true, vlen_enc);
> 
> The number of xtmp and ktmp registers could also be further reduced.

Hi @sviswa7 , your comments have been addressed. 
Hi @vnkozlov can you kindly re-run this through your test framework.

-------------

PR: https://git.openjdk.org/jdk/pull/9498


More information about the hotspot-compiler-dev mailing list