RFR: 8290322: Optimize Vector.rearrange over byte vectors for AVX512BW targets. [v2]

Thu Jul 28 11:11:51 UTC 2022

On Wed, 27 Jul 2022 16:13:57 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Hi All,
>> 
>> Currently re-arrange over 512bit bytevector is optimized for targets supporting AVX512_VBMI feature, this patch generates efficient JIT sequence to handle it for AVX512BW targets.  Following performance results with newly added benchmark shows
>> significant speedup.
>> 
>> System:  Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz (CascadeLake 28C 2S)
>> 
>> 
>> Baseline:
>> =========
>> Benchmark                                     (size)   Mode  Cnt      Score   Error   Units
>> RearrangeBytesBenchmark.testRearrangeBytes16     512  thrpt    2  16350.330          ops/ms
>> RearrangeBytesBenchmark.testRearrangeBytes32     512  thrpt    2  15991.346          ops/ms
>> RearrangeBytesBenchmark.testRearrangeBytes64     512  thrpt    2     34.423          ops/ms
>> RearrangeBytesBenchmark.testRearrangeBytes8      512  thrpt    2  10873.348          ops/ms
>> 
>> 
>> With-opt:
>> =========
>> Benchmark                                     (size)   Mode  Cnt      Score   Error   Units
>> RearrangeBytesBenchmark.testRearrangeBytes16     512  thrpt    2  16062.624          ops/ms
>> RearrangeBytesBenchmark.testRearrangeBytes32     512  thrpt    2  16028.494          ops/ms
>> RearrangeBytesBenchmark.testRearrangeBytes64     512  thrpt    2   8741.901          ops/ms
>> RearrangeBytesBenchmark.testRearrangeBytes8      512  thrpt    2  10983.226          ops/ms
>> 
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision:
> 
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8290322
>  - 8290322: Optimize Vector.rearrange over byte vectors for AVX512BW targets.

Otherwise looks good to me. Thanks.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5568:

> 5566: #endif
> 5567: 
> 5568: void C2_MacroAssembler::rearrange_bytes(XMMRegister dst, XMMRegister shuffle, XMMRegister src, XMMRegister xtmp1,

Can we use the same approach as that used for 256-bit vector. Something similar to:

    vpshufb(xtmp1, src, shuffle); // All elements are at the correct place modulo 16
    vpxor(dst, dst, dst);
    vpslld(xtmp2, shuffle, 3); // Push the digit signifying the parity of 128-bit lane to the sign digit
    vpcmpb(ktmp, xtmp2, dst, lt);
    vshufi32x4(xtmp2, xtmp1, xtmp1, 0b10110001); // Shuffle the 128-bit lanes to get 1 - 0 - 3 - 2
    vpblendmb(xtmp1, ktmp, xtmp1, xtmp2); // All elements are at the correct place modulo 32
    vpslld(xtmp2, shuffle, 2); // Push the digit signifying the parity of 256-bit lane to the sign digit
    vpcmpb(ktmp, xtmp2, dst, lt);
    vshufi32x4(xtmp2, xtmp1, xtmp1, 0b01001110); // Shuffle the 128-bit lanes to get 2 - 3 - 0 - 1
    vpblendmb(dst, ktmp, xtmp1, xtmp2); // All elements are at the correct place modulo 64

src/hotspot/cpu/x86/x86.ad line 1851:

> 1849:       } else if (size_in_bits == 256 && UseAVX < 2) {
> 1850:         return false; // Implementation limitation
> 1851:       } else if (is_subword_type(bt) && size_in_bits > 256 && !VM_Version::supports_avx512bw()) {

This is not needed as a 512-bit subword type vector is only supported on avx512bw anyway.

-------------

PR: https://git.openjdk.org/jdk/pull/9498