RFR: 8290322: Optimize Vector.rearrange over byte vectors for AVX512BW targets. [v3]
Sandhya Viswanathan
sviswanathan at openjdk.org
Wed Aug 17 23:02:15 UTC 2022
On Tue, 16 Aug 2022 15:17:57 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
>> Hi All,
>>
>> Currently re-arrange over 512bit bytevector is optimized for targets supporting AVX512_VBMI feature, this patch generates efficient JIT sequence to handle it for AVX512BW targets. Following performance results with newly added benchmark shows
>> significant speedup.
>>
>> System: Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz (CascadeLake 28C 2S)
>>
>>
>> Baseline:
>> =========
>> Benchmark (size) Mode Cnt Score Error Units
>> RearrangeBytesBenchmark.testRearrangeBytes16 512 thrpt 2 16350.330 ops/ms
>> RearrangeBytesBenchmark.testRearrangeBytes32 512 thrpt 2 15991.346 ops/ms
>> RearrangeBytesBenchmark.testRearrangeBytes64 512 thrpt 2 34.423 ops/ms
>> RearrangeBytesBenchmark.testRearrangeBytes8 512 thrpt 2 10873.348 ops/ms
>>
>>
>> With-opt:
>> =========
>> Benchmark (size) Mode Cnt Score Error Units
>> RearrangeBytesBenchmark.testRearrangeBytes16 512 thrpt 2 16062.624 ops/ms
>> RearrangeBytesBenchmark.testRearrangeBytes32 512 thrpt 2 16028.494 ops/ms
>> RearrangeBytesBenchmark.testRearrangeBytes64 512 thrpt 2 8741.901 ops/ms
>> RearrangeBytesBenchmark.testRearrangeBytes8 512 thrpt 2 10983.226 ops/ms
>>
>>
>> Kindly review and share your feedback.
>>
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision:
>
> - 8290322: Review comments resolution.
> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8290322
> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8290322
> - 8290322: Optimize Vector.rearrange over byte vectors for AVX512BW targets.
Very nice work. I have some point suggestions which improves the performance further by ~20%. Please take a look.
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5703:
> 5701: evpcmpb(ktmp2, k0, shuffle, xtmp2, Assembler::lt, true, vlen_enc);
> 5702: vpermq(xtmp3, src, 0x44, vlen_enc);
> 5703: vinserti64x4(xtmp3, xtmp3, xtmp3, 0x1);
The evshuffi64x2 instruction could be used here:
evshuffi64x2(xtmp3, k0, src, src, 0x0, false, vlen_enc);
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5705:
> 5703: vinserti64x4(xtmp3, xtmp3, xtmp3, 0x1);
> 5704: vpshufb(xtmp3, xtmp3, shuffle, vlen_enc);
> 5705: evmovdqub(dst, ktmp2, xtmp3, false, vlen_enc);
The avx512 version of vpshufb takes K register and merge as the input.
This can then be replaced by:
evpshufb(dst, ktmp2, xtmp3, shuffle, false, vlen_enc);
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5712:
> 5710: vpsllq(xtmp5, xtmp2, 0x1, vlen_enc);
> 5711: evpcmpb(ktmp2, k0, shuffle, xtmp5, Assembler::lt, true, vlen_enc);
> 5712: kandql(ktmp2, ktmp1, ktmp2);
This can be replaced by:
evpcmpb(ktmp2, ktmp1, shuffle, xtmp5, Assembler::lt, true, vlen_enc);
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5714:
> 5712: kandql(ktmp2, ktmp1, ktmp2);
> 5713: vpermq(xtmp3, src, 0xEE, vlen_enc);
> 5714: vinserti64x4(xtmp3, xtmp3, xtmp3, 0x1);
The evshuffi64x2 instruction could be used here:
evshuffi64x2(xtmp3, k0, src, src, 0x55, false, vlen_enc);
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5717:
> 5715: vpshufb(xtmp3, xtmp3, shuffle, vlen_enc);
> 5716: evmovdqub(xtmp4, ktmp2, xtmp3, false, vlen_enc);
> 5717: vporq(dst, dst, xtmp4, vlen_enc);
This can be replaced by:
evpshufb(dst, ktmp2, xtmp3, shuffle, true, vlen_enc);
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5724:
> 5722: vpaddb(xtmp5, xtmp2, xtmp5, vlen_enc);
> 5723: evpcmpb(ktmp2, k0, shuffle, xtmp5, Assembler::lt, true, vlen_enc);
> 5724: kandql(ktmp2, ktmp1 , ktmp2);
This can be replaced by:
evpcmpb(ktmp2, ktmp1, shuffle, xtmp5, Assembler::lt, true, vlen_enc);
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5727:
> 5725: vpermq(xtmp3, src, 0x44, vlen_enc);
> 5726: vextracti64x4_high(xtmp3, xtmp3);
> 5727: vinserti64x4(xtmp3, xtmp3, xtmp3, 0x1);
The evshuffi64x2 instruction could be used here:
evshuffi64x2(xtmp3, k0, src, src, 0xaa, false, vlen_enc);
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5730:
> 5728: vpshufb(xtmp3, xtmp3, shuffle, vlen_enc);
> 5729: evmovdqub(xtmp4, ktmp2, xtmp3, false, vlen_enc);
> 5730: vporq(dst, dst, xtmp4, vlen_enc);
This can be replaced by:
evpshufb(dst, ktmp2, xtmp3, shuffle, true, vlen_enc);
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5737:
> 5735: vpsllq(xtmp5, xtmp2, 0x2, vlen_enc);
> 5736: evpcmpb(ktmp2, k0, shuffle, xtmp5, Assembler::lt, true, vlen_enc);
> 5737: kandql(ktmp2, ktmp1 , ktmp2);
This can be replaced by:
evpcmpb(ktmp2, ktmp1, shuffle, xtmp5, Assembler::lt, true, vlen_enc);
You could also use ktmp1 as the destination in the above instruction and its use thereby remove the ktmp2 usage altogether.
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5740:
> 5738: vpermq(xtmp3, src, 0xEE, vlen_enc);
> 5739: vextracti64x4_high(xtmp3, xtmp3);
> 5740: vinserti64x4(xtmp3, xtmp3, xtmp3, 0x1);
The evshuffi64x2 instruction could be used here:
evshuffi64x2(xtmp3, k0, src, src, 0xff, false, vlen_enc);
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5743:
> 5741: vpshufb(xtmp3, xtmp3, shuffle, vlen_enc);
> 5742: evmovdqub(xtmp4, ktmp2, xtmp3, false, vlen_enc);
> 5743: vporq(dst, dst, xtmp4, vlen_enc);
This can be replaced by:
evpshufb(dst, ktmp2, xtmp3, shuffle, true, vlen_enc);
-------------
PR: https://git.openjdk.org/jdk/pull/9498
More information about the hotspot-compiler-dev
mailing list