[vectorIntrinsics] RFR: 8284459: Add x86 back-end implementation for LEADING and TRAILING ZEROS COUNT operations [v3]

Wed Apr 20 20:56:06 UTC 2022

On Tue, 19 Apr 2022 00:08:13 GMT, Sandhya Viswanathan <sviswanathan at openjdk.org> wrote:

>> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   8284459: Adding auto-vectorizer and x86 backend support for TRAILING_ZERO_COUNT, also some code re-organization.
>
> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4441:
> 
>> 4439:   if ((is_LP64 || lane_size < 8) &&
>> 4440:       ((is_non_subword_integral_type(bt) && VM_Version::supports_avx512vl()) ||
>> 4441:        (is_subword_type(bt) && VM_Version::supports_avx512bw()))) {
> 
> The vl check is needed for all lane sizes when vector width < 64 bytes. The check doesn't seem to capture that.

DONE

> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4750:
> 
>> 4748:       break;
>> 4749:     case T_INT:
>> 4750:       evplzcntd(dst, ktmp, src, merge, vec_enc);
> 
> The ktmp here should be k0. An assert here or use explicit k0.

ktmp will be a non k0 opmask register for predicated leading zero count only for INT/LONG type.   match_rule_supported_vector_masked carries necessary conditions.

> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4752:
> 
>> 4750:       evplzcntd(dst, ktmp, src, merge, vec_enc);
>> 4751:       break;
>> 4752:     case T_SHORT:
> 
> Need an assert to verify that xtmp2 is not xnoreg here.

xnoreg represents -1 value, which results into an assertion failure due to illegal register encoding.

> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4757:
> 
>> 4755:       evplzcntd(xtmp2, k0, xtmp2, merge, vec_enc);
>> 4756:       vpunpckhwd(dst, xtmp1, src, vec_enc);
>> 4757:       evplzcntd(dst, k0, dst, merge, vec_enc);
> 
> ktmp and k0 usage is mixed here in this function. It is possible to simplify and use always k0 in vector_count_leading_zeros_evex (meaning no mask).

This is a common routine which gets used for both predicated (in case of INT/LONG) and non-predicated count zeros, Replacing k0 with ktmp and passing k0 from top level patterns will add the needed consistency.

> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4769:
> 
>> 4767:       evmovdquq(xtmp1, ExternalAddress(StubRoutines::x86::vector_count_leading_zeros_lut()), vec_enc, rtmp);
>> 4768:       movl(rtmp, 0x0F0F0F0F);
>> 4769:       evpbroadcastd(dst, rtmp, vec_enc);
> 
> Use the new vpbroadcast() function here.
> Also an assert to verify that rtmp is not noreg, xtmp2, xtmp3 is not noreg.

In case of noreg assembler will fail due to assertion failure.

> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4777:
> 
>> 4775:       vpxor(xtmp1, xtmp1, xtmp1, vec_enc);
>> 4776:       evpcmpeqb(ktmp, xtmp1, xtmp3, vec_enc);
>> 4777:       evpaddb(dst, ktmp, dst, xtmp2, true, vec_enc);
> 
> It is possible to do this without needing xtmp3:
>      // Nibble clz table in xtmp1
>       evmovdquq(xtmp1, ExternalAddress(StubRoutines::x86::vector_count_leading_zeros_lut()), vec_enc, rtmp);
>      // Nibble mask in xtmp2 
>       movl(rtmp, 0x0F0F0F0F);
>       evpbroadcastd(xtmp2, rtmp, vec_enc);  
>       // Get upper nibble in low 4 bits of dst
>       vpsrlw(dst, src, 4, vec_enc);
>       vpand(dst, dst, xtmp2, vec_enc);
>       // Get clz of upper nibble into dst using table in xtmp1
>       vpshufb(dst, xtmp1, dst, vec_enc);
>       // Get lower nibble in low 4 bits of xtmp2 overwriting the nibble mask
>       vpand(xtmp2, xtmp2, src, vec_enc);
>       // Get clz of lower nibble in xtmp2 using the table in xtmp1
>       vpshufb(xtmp2, xtmp1, xtmp2, vec_enc);
>       // Broadcast the clz of 0 into all lanes of xtmp1, note the lowest byte had clz of zero in the xtmp1 table 
>       evpbroadcastb(xtmp1, xtmp1, xtmp1, vec_enc);
>       // Check if the clz of upper nibble in dst indicates that the upper nibble was all zero
>       evpcmpeqb(ktmp, xtmp1, dst, vec_enc);
>       // if upper nibble was zero add the clz of lower nibble to dst
>       evpaddb(dst, ktmp, dst, xtmp2,  true, vec_enc);

Thanks, Sandhya,  you have introduced additional broadcast instruction to save a temporary.  Spill to memory may be costly too but adding an explicit instruction in anticipation of a spill may hit the performance. Also we will not be able to further merge this instruction pattern even after removing additional temporary.

> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4803:
> 
>> 4801:   vpcmpeqb(xtmp3, xtmp1, xtmp3, vec_enc);
>> 4802:   vpaddb(dst, dst, xtmp2, vec_enc);
>> 4803:   vpblendvb(dst, xtmp2, dst, xtmp3, vec_enc);
> 
> vector_count_leading_zeros_byte_avx can be implemented without xmpt3 as follows:
>  // Nibble clz table in xtmp1
>  vmovdqu(xtmp1, ExternalAddress(StubRoutines::x86::vector_count_leading_zeros_lut()), rtmp);
>  
> // Nibble mask in xtmp2
>  vbroadcast(T_INT, xtmp2, 0x0F0F0F0F, rtmp, vec_enc);
> 
>   // dst = Compute leading zero counts of 4 MSB bits of each byte by
>   // accessing the lookup table
>   vpsrlw(dst, src, 4, vec_enc);
>   vpand(dst, dst, xtmp2, vec_enc);
>   vpshufb(dst, xtmp1, dst, vec_enc);
> 
>   // xtmp2 = Compute leading zero counts of 4 LSB bits of each byte by
>   // accessing the lookup table.
>   vpand(xtmp2, xtmp2, src, vec_enc);
>   vpshufb(xtmp2, xtmp1, xtmp2, vec_enc);
>   
>   // Add xtmp2 to dst if 4 MSB bits of byte are all zeros i.e. if the dst had clz of 0
>   vpbroadcastb(xtmp1, xtmp1, vec_enc); 
>   vpcmpeqb(xtmp1, xtmp1, dst, vec_enc);
>   vpaddb(xtmp2, xtmp2, dst, vec_enc);
>   vpblendvb(dst, dst, xtmp2, xtmp1, vec_enc);

Thanks, Sandhya,  you have introduced additional broadcast instruction to save a temporary.  Spill to memory may be costly too but adding an explicit instruction in anticipation of a spill may hit the performance. Also we will not be able to further merge this instruction pattern even after removing additional temporary. 

// Add xtmp2 to dst if 4 MSB bits of byte are all zeros i.e. if the dst had clz of 0
We add LZ count of 4MSB bits if they are all zeros i.e. clz of 4.

> src/hotspot/cpu/x86/stubGenerator_x86_32.cpp line 610:
> 
>> 608:     __ emit_data(0x01010101, relocInfo::none, 0);
>> 609:     __ emit_data(0x00000000, relocInfo::none, 0);
>> 610:     __ emit_data(0x00000000, relocInfo::none, 0);
> 
> could be done with a 4 iteration for loop over the following:
>     __ emit_data(0x02020304, relocInfo::none, 0);
>     __ emit_data(0x01010101, relocInfo::none, 0);
>     __ emit_data(0x00000000, relocInfo::none, 0);
>     __ emit_data(0x00000000, relocInfo::none, 0);

PSHUFB operates at a lane level (16 bytes), thus in order to perform lookup for upper lane we either load 32/64 bytes from memory or we will have to replicated lower lane across vector. I choose first approach to be consistent with other LUT based implementations.

> src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 809:
> 
>> 807:     __ emit_data64(0x0000000000000000, relocInfo::none);
>> 808:     __ emit_data64(0x0101010102020304, relocInfo::none);
>> 809:     __ emit_data64(0x0000000000000000, relocInfo::none);
> 
> could be done with a 4 iteration for loop over the following:
>     __ emit_data64(0x0101010102020304, relocInfo::none);
>     __ emit_data64(0x0000000000000000, relocInfo::none);

Same as above.

> src/hotspot/cpu/x86/x86.ad line 8687:
> 
>> 8685: %}
>> 8686: 
>> 8687: instruct vpopcount_evx_reg(vec dst, vec src, vec xtmp1, vec xtmp2, rRegP rtmp) %{
> 
> Typo,  did you mean vpopcount_avx_reg here?

Corrected.

> src/hotspot/cpu/x86/x86.ad line 8696:
> 
>> 8694:     int opcode = this->ideal_Opcode();
>> 8695:     int vlen_enc = vector_length_encoding(this, $src);
>> 8696:     BasicType bt = Matcher::vector_element_basic_type(this);
> 
> We should be checking the vector_element_basic_type of src instead of this. e.g. for Long vectors producing Int results, we need to pass bt as T_LONG to vector_popcount_integral.

Corrected, in other patterns I am extracting basic type from src.

-------------

PR: https://git.openjdk.java.net/panama-vector/pull/189