RFR: 8290322: Optimize Vector.rearrange over byte vectors for AVX512BW targets. [v3]

Wed Aug 17 23:21:15 UTC 2022

On Tue, 16 Aug 2022 15:17:57 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Hi All,
>> 
>> Currently re-arrange over 512bit bytevector is optimized for targets supporting AVX512_VBMI feature, this patch generates efficient JIT sequence to handle it for AVX512BW targets.  Following performance results with newly added benchmark shows
>> significant speedup.
>> 
>> System:  Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz (CascadeLake 28C 2S)
>> 
>> 
>> Baseline:
>> =========
>> Benchmark                                     (size)   Mode  Cnt      Score   Error   Units
>> RearrangeBytesBenchmark.testRearrangeBytes16     512  thrpt    2  16350.330          ops/ms
>> RearrangeBytesBenchmark.testRearrangeBytes32     512  thrpt    2  15991.346          ops/ms
>> RearrangeBytesBenchmark.testRearrangeBytes64     512  thrpt    2     34.423          ops/ms
>> RearrangeBytesBenchmark.testRearrangeBytes8      512  thrpt    2  10873.348          ops/ms
>> 
>> 
>> With-opt:
>> =========
>> Benchmark                                     (size)   Mode  Cnt      Score   Error   Units
>> RearrangeBytesBenchmark.testRearrangeBytes16     512  thrpt    2  16062.624          ops/ms
>> RearrangeBytesBenchmark.testRearrangeBytes32     512  thrpt    2  16028.494          ops/ms
>> RearrangeBytesBenchmark.testRearrangeBytes64     512  thrpt    2   8741.901          ops/ms
>> RearrangeBytesBenchmark.testRearrangeBytes8      512  thrpt    2  10983.226          ops/ms
>> 
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision:
> 
>  - 8290322: Review comments resolution.
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8290322
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8290322
>  - 8290322: Optimize Vector.rearrange over byte vectors for AVX512BW targets.

With above suggestions, the rearrange_bytes() kernel would look something like below:
  movl(rtmp, 16);
  evpbroadcastb(xtmp2, rtmp, vlen_enc);

  // Compute a mask for shuffle vector by comparing indices with expression INDEX < 16,
  // Broadcast first 128 bit lane across entire vector, shuffle the vector lanes using
  // original shuffle indices and move the shuffled lanes corresponding to true
  // mask to destination vector.
  evpcmpb(ktmp2, k0, shuffle, xtmp2, Assembler::lt, true, vlen_enc);
  evshuffi64x2(xtmp3, k0, src, src, 0x0, false, vlen_enc);
  evpshufb(dst, ktmp2, xtmp3, shuffle, false, vlen_enc);

  // Perform above steps with lane comparison expression as INDEX >= 16 && INDEX < 32
  // and broadcasting second 128 bit lane.

  evpcmpb(ktmp1, k0, shuffle,  xtmp2, Assembler::nlt, true, vlen_enc);
  vpsllq(xtmp5, xtmp2, 0x1, vlen_enc);
  evpcmpb(ktmp2, ktmp1, shuffle, xtmp5, Assembler::lt, true, vlen_enc);
  evshuffi64x2(xtmp3, k0, src, src, 0x55, false, vlen_enc);
  evpshufb(dst, ktmp2, xtmp3, shuffle, true, vlen_enc);

  // Perform above steps with lane comparison expression as INDEX >= 32 && INDEX < 48
  // and broadcasting third 128 bit lane.
  evpcmpb(ktmp1, k0, shuffle,  xtmp5, Assembler::nlt, true, vlen_enc);
  vpaddb(xtmp5, xtmp2, xtmp5, vlen_enc);
  evpcmpb(ktmp2, ktmp1, shuffle,  xtmp5, Assembler::lt, true, vlen_enc);
  evshuffi64x2(xtmp3, k0, src, src, 0xaa, false, vlen_enc);
  evpshufb(dst, ktmp2, xtmp3, shuffle, true, vlen_enc);

  // Perform above steps with lane comparison expression as INDEX >= 48 && INDEX < 64
  // and broadcasting third 128 bit lane.
  evpcmpb(ktmp1, k0, shuffle,  xtmp5, Assembler::nlt, true, vlen_enc);
  vpsllq(xtmp5, xtmp2, 0x2, vlen_enc);
  evpcmpb(ktmp2, ktmp1, shuffle,  xtmp5, Assembler::lt, true, vlen_enc);
  evshuffi64x2(xtmp3, k0, src, src, 0xff, false, vlen_enc);
  evpshufb(dst, ktmp2, xtmp3, shuffle, true, vlen_enc);

The number of xtmp and ktmp registers could also be further reduced.

-------------

PR: https://git.openjdk.org/jdk/pull/9498