[vectorIntrinsics] RFR: 8284459: Add x86 back-end implementation for LEADING and TRAILING ZEROS COUNT operations [v3]
Jatin Bhateja
jbhateja at openjdk.java.net
Wed Apr 20 20:56:06 UTC 2022
On Tue, 19 Apr 2022 00:08:13 GMT, Sandhya Viswanathan <sviswanathan at openjdk.org> wrote:
>> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>>
>> 8284459: Adding auto-vectorizer and x86 backend support for TRAILING_ZERO_COUNT, also some code re-organization.
>
> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4441:
>
>> 4439: if ((is_LP64 || lane_size < 8) &&
>> 4440: ((is_non_subword_integral_type(bt) && VM_Version::supports_avx512vl()) ||
>> 4441: (is_subword_type(bt) && VM_Version::supports_avx512bw()))) {
>
> The vl check is needed for all lane sizes when vector width < 64 bytes. The check doesn't seem to capture that.
DONE
> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4750:
>
>> 4748: break;
>> 4749: case T_INT:
>> 4750: evplzcntd(dst, ktmp, src, merge, vec_enc);
>
> The ktmp here should be k0. An assert here or use explicit k0.
ktmp will be a non k0 opmask register for predicated leading zero count only for INT/LONG type. match_rule_supported_vector_masked carries necessary conditions.
> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4752:
>
>> 4750: evplzcntd(dst, ktmp, src, merge, vec_enc);
>> 4751: break;
>> 4752: case T_SHORT:
>
> Need an assert to verify that xtmp2 is not xnoreg here.
xnoreg represents -1 value, which results into an assertion failure due to illegal register encoding.
> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4757:
>
>> 4755: evplzcntd(xtmp2, k0, xtmp2, merge, vec_enc);
>> 4756: vpunpckhwd(dst, xtmp1, src, vec_enc);
>> 4757: evplzcntd(dst, k0, dst, merge, vec_enc);
>
> ktmp and k0 usage is mixed here in this function. It is possible to simplify and use always k0 in vector_count_leading_zeros_evex (meaning no mask).
This is a common routine which gets used for both predicated (in case of INT/LONG) and non-predicated count zeros, Replacing k0 with ktmp and passing k0 from top level patterns will add the needed consistency.
> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4769:
>
>> 4767: evmovdquq(xtmp1, ExternalAddress(StubRoutines::x86::vector_count_leading_zeros_lut()), vec_enc, rtmp);
>> 4768: movl(rtmp, 0x0F0F0F0F);
>> 4769: evpbroadcastd(dst, rtmp, vec_enc);
>
> Use the new vpbroadcast() function here.
> Also an assert to verify that rtmp is not noreg, xtmp2, xtmp3 is not noreg.
In case of noreg assembler will fail due to assertion failure.
> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4777:
>
>> 4775: vpxor(xtmp1, xtmp1, xtmp1, vec_enc);
>> 4776: evpcmpeqb(ktmp, xtmp1, xtmp3, vec_enc);
>> 4777: evpaddb(dst, ktmp, dst, xtmp2, true, vec_enc);
>
> It is possible to do this without needing xtmp3:
> // Nibble clz table in xtmp1
> evmovdquq(xtmp1, ExternalAddress(StubRoutines::x86::vector_count_leading_zeros_lut()), vec_enc, rtmp);
> // Nibble mask in xtmp2
> movl(rtmp, 0x0F0F0F0F);
> evpbroadcastd(xtmp2, rtmp, vec_enc);
> // Get upper nibble in low 4 bits of dst
> vpsrlw(dst, src, 4, vec_enc);
> vpand(dst, dst, xtmp2, vec_enc);
> // Get clz of upper nibble into dst using table in xtmp1
> vpshufb(dst, xtmp1, dst, vec_enc);
> // Get lower nibble in low 4 bits of xtmp2 overwriting the nibble mask
> vpand(xtmp2, xtmp2, src, vec_enc);
> // Get clz of lower nibble in xtmp2 using the table in xtmp1
> vpshufb(xtmp2, xtmp1, xtmp2, vec_enc);
> // Broadcast the clz of 0 into all lanes of xtmp1, note the lowest byte had clz of zero in the xtmp1 table
> evpbroadcastb(xtmp1, xtmp1, xtmp1, vec_enc);
> // Check if the clz of upper nibble in dst indicates that the upper nibble was all zero
> evpcmpeqb(ktmp, xtmp1, dst, vec_enc);
> // if upper nibble was zero add the clz of lower nibble to dst
> evpaddb(dst, ktmp, dst, xtmp2, true, vec_enc);
Thanks, Sandhya, you have introduced additional broadcast instruction to save a temporary. Spill to memory may be costly too but adding an explicit instruction in anticipation of a spill may hit the performance. Also we will not be able to further merge this instruction pattern even after removing additional temporary.
> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4803:
>
>> 4801: vpcmpeqb(xtmp3, xtmp1, xtmp3, vec_enc);
>> 4802: vpaddb(dst, dst, xtmp2, vec_enc);
>> 4803: vpblendvb(dst, xtmp2, dst, xtmp3, vec_enc);
>
> vector_count_leading_zeros_byte_avx can be implemented without xmpt3 as follows:
> // Nibble clz table in xtmp1
> vmovdqu(xtmp1, ExternalAddress(StubRoutines::x86::vector_count_leading_zeros_lut()), rtmp);
>
> // Nibble mask in xtmp2
> vbroadcast(T_INT, xtmp2, 0x0F0F0F0F, rtmp, vec_enc);
>
> // dst = Compute leading zero counts of 4 MSB bits of each byte by
> // accessing the lookup table
> vpsrlw(dst, src, 4, vec_enc);
> vpand(dst, dst, xtmp2, vec_enc);
> vpshufb(dst, xtmp1, dst, vec_enc);
>
> // xtmp2 = Compute leading zero counts of 4 LSB bits of each byte by
> // accessing the lookup table.
> vpand(xtmp2, xtmp2, src, vec_enc);
> vpshufb(xtmp2, xtmp1, xtmp2, vec_enc);
>
> // Add xtmp2 to dst if 4 MSB bits of byte are all zeros i.e. if the dst had clz of 0
> vpbroadcastb(xtmp1, xtmp1, vec_enc);
> vpcmpeqb(xtmp1, xtmp1, dst, vec_enc);
> vpaddb(xtmp2, xtmp2, dst, vec_enc);
> vpblendvb(dst, dst, xtmp2, xtmp1, vec_enc);
Thanks, Sandhya, you have introduced additional broadcast instruction to save a temporary. Spill to memory may be costly too but adding an explicit instruction in anticipation of a spill may hit the performance. Also we will not be able to further merge this instruction pattern even after removing additional temporary.
// Add xtmp2 to dst if 4 MSB bits of byte are all zeros i.e. if the dst had clz of 0
We add LZ count of 4MSB bits if they are all zeros i.e. clz of 4.
> src/hotspot/cpu/x86/stubGenerator_x86_32.cpp line 610:
>
>> 608: __ emit_data(0x01010101, relocInfo::none, 0);
>> 609: __ emit_data(0x00000000, relocInfo::none, 0);
>> 610: __ emit_data(0x00000000, relocInfo::none, 0);
>
> could be done with a 4 iteration for loop over the following:
> __ emit_data(0x02020304, relocInfo::none, 0);
> __ emit_data(0x01010101, relocInfo::none, 0);
> __ emit_data(0x00000000, relocInfo::none, 0);
> __ emit_data(0x00000000, relocInfo::none, 0);
PSHUFB operates at a lane level (16 bytes), thus in order to perform lookup for upper lane we either load 32/64 bytes from memory or we will have to replicated lower lane across vector. I choose first approach to be consistent with other LUT based implementations.
> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 809:
>
>> 807: __ emit_data64(0x0000000000000000, relocInfo::none);
>> 808: __ emit_data64(0x0101010102020304, relocInfo::none);
>> 809: __ emit_data64(0x0000000000000000, relocInfo::none);
>
> could be done with a 4 iteration for loop over the following:
> __ emit_data64(0x0101010102020304, relocInfo::none);
> __ emit_data64(0x0000000000000000, relocInfo::none);
Same as above.
> src/hotspot/cpu/x86/x86.ad line 8687:
>
>> 8685: %}
>> 8686:
>> 8687: instruct vpopcount_evx_reg(vec dst, vec src, vec xtmp1, vec xtmp2, rRegP rtmp) %{
>
> Typo, did you mean vpopcount_avx_reg here?
Corrected.
> src/hotspot/cpu/x86/x86.ad line 8696:
>
>> 8694: int opcode = this->ideal_Opcode();
>> 8695: int vlen_enc = vector_length_encoding(this, $src);
>> 8696: BasicType bt = Matcher::vector_element_basic_type(this);
>
> We should be checking the vector_element_basic_type of src instead of this. e.g. for Long vectors producing Int results, we need to pass bt as T_LONG to vector_popcount_integral.
Corrected, in other patterns I am extracting basic type from src.
-------------
PR: https://git.openjdk.java.net/panama-vector/pull/189
More information about the panama-dev
mailing list