[vectorIntrinsics] RFR: 8284459: Add x86 back-end implementation for LEADING and TRAILING ZEROS COUNT operations [v3]
Sandhya Viswanathan
sviswanathan at openjdk.java.net
Tue Apr 19 03:07:44 UTC 2022
On Fri, 15 Apr 2022 21:44:53 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
>> Summary of changes:
>> - Patch extends auto-vectorize to vectorize following Java SE APIs.
>> 1) Integer.numberOfLeadingZeros()
>> 2) Long.numberOfLeadingZeros()
>> 3) Integer.numberOfTrailingZeros()
>> 4) Long.numberOfTrailingZeros()
>>
>> - Adds optimized X86 backend implementation for VectorOperations.LEADING_ZERO_COUNT and VectorOperations.TRAILING_ZEROS_COUNT for AVX512 and legacy targets.
>>
>> Kindly review and share your feedback.
>>
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>
> 8284459: Adding auto-vectorizer and x86 backend support for TRAILING_ZERO_COUNT, also some code re-organization.
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4750:
> 4748: break;
> 4749: case T_INT:
> 4750: evplzcntd(dst, ktmp, src, merge, vec_enc);
The ktmp here should be k0. An assert here or use explicit k0.
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4752:
> 4750: evplzcntd(dst, ktmp, src, merge, vec_enc);
> 4751: break;
> 4752: case T_SHORT:
Need an assert to verify that xtmp2 is not xnoreg here.
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4757:
> 4755: evplzcntd(xtmp2, k0, xtmp2, merge, vec_enc);
> 4756: vpunpckhwd(dst, xtmp1, src, vec_enc);
> 4757: evplzcntd(dst, k0, dst, merge, vec_enc);
ktmp and k0 usage is mixed here in this function. It is possible to simplify and use always k0 in vector_count_leading_zeros_evex (meaning no mask).
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4769:
> 4767: evmovdquq(xtmp1, ExternalAddress(StubRoutines::x86::vector_count_leading_zeros_lut()), vec_enc, rtmp);
> 4768: movl(rtmp, 0x0F0F0F0F);
> 4769: evpbroadcastd(dst, rtmp, vec_enc);
Use the new vpbroadcast() function here.
Also an assert to verify that rtmp is not noreg, xtmp2, xtmp3 is not noreg.
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4777:
> 4775: vpxor(xtmp1, xtmp1, xtmp1, vec_enc);
> 4776: evpcmpeqb(ktmp, xtmp1, xtmp3, vec_enc);
> 4777: evpaddb(dst, ktmp, dst, xtmp2, true, vec_enc);
It is possible to do this without needing xtmp3:
// Nibble clz table in xtmp1
evmovdquq(xtmp1, ExternalAddress(StubRoutines::x86::vector_count_leading_zeros_lut()), vec_enc, rtmp);
// Nibble mask in xtmp2
movl(rtmp, 0x0F0F0F0F);
evpbroadcastd(xtmp2, rtmp, vec_enc);
// Get upper nibble in low 4 bits of dst
vpsrlw(dst, src, 4, vec_enc);
vpand(dst, dst, xtmp2, vec_enc);
// Get clz of upper nibble into dst using table in xtmp1
vpshufb(dst, xtmp1, dst, vec_enc);
// Get lower nibble in low 4 bits of xtmp2 overwriting the nibble mask
vpand(xtmp2, xtmp2, src, vec_enc);
// Get clz of lower nibble in xtmp2 using the table in xtmp1
vpshufb(xtmp2, xtmp1, xtmp2, vec_enc);
// Broadcast the clz of 0 into all lanes of xtmp1, note the lowest byte had clz of zero in the xtmp1 table
evpbroadcastb(xtmp1, xtmp1, xtmp1, vec_enc);
// Check if the clz of upper nibble in dst indicates that the upper nibble was all zero
evpcmpeqb(ktmp, xtmp1, dst, vec_enc);
// if upper nibble was zero add the clz of lower nibble to dst
evpaddb(dst, ktmp, dst, xtmp2, true, vec_enc);
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4964:
> 4962: vpternlogd(xtmp4, 0x40, xtmp4, src, vec_enc);
> 4963: vector_count_leading_zeros_evex(bt, dst, xtmp4, xtmp1, xtmp2, xtmp3, ktmp, rtmp, true, vec_enc);
> 4964: vbroadcast(bt, xtmp4, bcast_value[type2aelembytes(bt) - 1], rtmp, vec_enc);
No need for bcast_value. It is simply 0x8 & type2aelembytes(bt).
-------------
PR: https://git.openjdk.java.net/panama-vector/pull/189
More information about the panama-dev
mailing list