RFR: 8290322: Optimize Vector.rearrange over byte vectors for AVX512BW targets. [v3]

Wed Aug 17 23:02:15 UTC 2022

On Tue, 16 Aug 2022 15:17:57 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Hi All,
>> 
>> Currently re-arrange over 512bit bytevector is optimized for targets supporting AVX512_VBMI feature, this patch generates efficient JIT sequence to handle it for AVX512BW targets.  Following performance results with newly added benchmark shows
>> significant speedup.
>> 
>> System:  Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz (CascadeLake 28C 2S)
>> 
>> 
>> Baseline:
>> =========
>> Benchmark                                     (size)   Mode  Cnt      Score   Error   Units
>> RearrangeBytesBenchmark.testRearrangeBytes16     512  thrpt    2  16350.330          ops/ms
>> RearrangeBytesBenchmark.testRearrangeBytes32     512  thrpt    2  15991.346          ops/ms
>> RearrangeBytesBenchmark.testRearrangeBytes64     512  thrpt    2     34.423          ops/ms
>> RearrangeBytesBenchmark.testRearrangeBytes8      512  thrpt    2  10873.348          ops/ms
>> 
>> 
>> With-opt:
>> =========
>> Benchmark                                     (size)   Mode  Cnt      Score   Error   Units
>> RearrangeBytesBenchmark.testRearrangeBytes16     512  thrpt    2  16062.624          ops/ms
>> RearrangeBytesBenchmark.testRearrangeBytes32     512  thrpt    2  16028.494          ops/ms
>> RearrangeBytesBenchmark.testRearrangeBytes64     512  thrpt    2   8741.901          ops/ms
>> RearrangeBytesBenchmark.testRearrangeBytes8      512  thrpt    2  10983.226          ops/ms
>> 
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision:
> 
>  - 8290322: Review comments resolution.
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8290322
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8290322
>  - 8290322: Optimize Vector.rearrange over byte vectors for AVX512BW targets.

Very nice work. I have some point suggestions which improves the performance further by ~20%. Please take a look.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5703:

> 5701:   evpcmpb(ktmp2, k0, shuffle, xtmp2, Assembler::lt, true, vlen_enc);
> 5702:   vpermq(xtmp3, src, 0x44, vlen_enc);
> 5703:   vinserti64x4(xtmp3, xtmp3, xtmp3, 0x1);

The evshuffi64x2 instruction could be used here:
evshuffi64x2(xtmp3, k0, src, src, 0x0, false, vlen_enc);

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5705:

> 5703:   vinserti64x4(xtmp3, xtmp3, xtmp3, 0x1);
> 5704:   vpshufb(xtmp3, xtmp3, shuffle, vlen_enc);
> 5705:   evmovdqub(dst, ktmp2, xtmp3, false, vlen_enc);

The avx512 version of vpshufb takes K register and merge as the input.
This can then be replaced by:
evpshufb(dst, ktmp2, xtmp3, shuffle, false, vlen_enc);

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5712:

> 5710:   vpsllq(xtmp5, xtmp2, 0x1, vlen_enc);
> 5711:   evpcmpb(ktmp2, k0, shuffle, xtmp5, Assembler::lt, true, vlen_enc);
> 5712:   kandql(ktmp2, ktmp1, ktmp2);

This can be replaced by:
evpcmpb(ktmp2, ktmp1, shuffle, xtmp5, Assembler::lt, true, vlen_enc);

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5714:

> 5712:   kandql(ktmp2, ktmp1, ktmp2);
> 5713:   vpermq(xtmp3, src,  0xEE, vlen_enc);
> 5714:   vinserti64x4(xtmp3, xtmp3, xtmp3, 0x1);

The evshuffi64x2 instruction could be used here:
evshuffi64x2(xtmp3, k0, src, src, 0x55, false, vlen_enc);

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5717:

> 5715:   vpshufb(xtmp3, xtmp3, shuffle, vlen_enc);
> 5716:   evmovdqub(xtmp4, ktmp2, xtmp3, false, vlen_enc);
> 5717:   vporq(dst, dst, xtmp4, vlen_enc);

This can be replaced by:
evpshufb(dst, ktmp2, xtmp3, shuffle, true, vlen_enc);

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5724:

> 5722:   vpaddb(xtmp5, xtmp2, xtmp5, vlen_enc);
> 5723:   evpcmpb(ktmp2, k0, shuffle,  xtmp5, Assembler::lt, true, vlen_enc);
> 5724:   kandql(ktmp2, ktmp1 , ktmp2);

This can be replaced by:
evpcmpb(ktmp2, ktmp1, shuffle, xtmp5, Assembler::lt, true, vlen_enc);

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5727:

> 5725:   vpermq(xtmp3, src,  0x44, vlen_enc);
> 5726:   vextracti64x4_high(xtmp3, xtmp3);
> 5727:   vinserti64x4(xtmp3, xtmp3, xtmp3, 0x1);

The evshuffi64x2 instruction could be used here:
evshuffi64x2(xtmp3, k0, src, src, 0xaa, false, vlen_enc);

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5730:

> 5728:   vpshufb(xtmp3, xtmp3, shuffle, vlen_enc);
> 5729:   evmovdqub(xtmp4, ktmp2, xtmp3, false, vlen_enc);
> 5730:   vporq(dst, dst, xtmp4, vlen_enc);

This can be replaced by:
evpshufb(dst, ktmp2, xtmp3, shuffle, true, vlen_enc);

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5737:

> 5735:   vpsllq(xtmp5, xtmp2, 0x2, vlen_enc);
> 5736:   evpcmpb(ktmp2, k0, shuffle,  xtmp5, Assembler::lt, true, vlen_enc);
> 5737:   kandql(ktmp2, ktmp1 , ktmp2);

This can be replaced by:
evpcmpb(ktmp2, ktmp1, shuffle, xtmp5, Assembler::lt, true, vlen_enc);

You could also use ktmp1 as the destination in the above instruction and its use thereby remove the ktmp2 usage altogether.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5740:

> 5738:   vpermq(xtmp3, src,  0xEE, vlen_enc);
> 5739:   vextracti64x4_high(xtmp3, xtmp3);
> 5740:   vinserti64x4(xtmp3, xtmp3, xtmp3, 0x1);

The evshuffi64x2 instruction could be used here:
evshuffi64x2(xtmp3, k0, src, src, 0xff, false, vlen_enc);

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5743:

> 5741:   vpshufb(xtmp3, xtmp3, shuffle, vlen_enc);
> 5742:   evmovdqub(xtmp4, ktmp2, xtmp3, false, vlen_enc);
> 5743:   vporq(dst, dst, xtmp4, vlen_enc);

This can be replaced by:
evpshufb(dst, ktmp2, xtmp3, shuffle, true, vlen_enc);

-------------

PR: https://git.openjdk.org/jdk/pull/9498