RFR: 8281375: Accelerate bitCount operation for AVX2 and AVX512 target. [v6]

Tue Mar 1 01:57:09 UTC 2022

On Fri, 25 Feb 2022 01:11:41 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Summary of changes:
>> 
>> - Patch extends existing vectorized bitCount optimization added with [JDK-8278868](https://bugs.openjdk.java.net/browse/JDK-8278868) and emits optimized JIT sequence for AVX2 and other AVX512 targets which do not support avx512_vpopcntdq feature.
>> - Since PopCountVI/PopCountVL node emit different instruction sequence based on the target features hence a rudimentary cost mode has been added which influences the SLP unrolling factor to prevent generating bloated main loops.
>> - Following are the performance results of an existing [JMH micro](https://github.com/jatin-bhateja/jdk/blob/master/test/micro/org/openjdk/bench/vm/compiler/VectorBitCount.java) over various X86 targets.
>> 
>> 
>> Benchmark | SIZE | Baseline AVX2 (ns/op) | Withopt AVX2 (ns/op) | Gain % | Baseline AVX3 (ns/op) | Withopt AVX3 (ns/op) | Gain % | Baseline AVX3 (VPOPCOUNTDQ) | Withopt AVX3 (VPOCOUNTDQ) | Gain %
>> -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
>> VectorBitCount.WithSuperword.intBitCount | 1024 | 1089.799 | 420.156 | 159.3796114 | 1083.92 | 203.958 | 431.442748 | 88.958 | 60.096 | 48.02649095
>> VectorBitCount.WithSuperword.longBitCount | 1024 | 417.458 | 413.859 | 0.869619846 | 417.203 | 214.949 | 94.09394787 | 105.954 | 117.019 | -9.455729411
>> 
>> Please review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   8281375: Fix a typo.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4363:

> 4361:     movl(rtmp, 0x0F0F0F0F);
> 4362:     evpbroadcastd(xtmp1, rtmp, vec_enc);
> 4363:     evmovdqul(xtmp2, k0, ExternalAddress(StubRoutines::x86::vector_popcount_lut()), true, vec_enc, rtmp);

In general merge can be set to false for all the instructions in this algorithm as the mask is always all true.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4375:

> 4373:     evpunpckldq(xtmp2, k0, xtmp3, xtmp1, true, vec_enc);
> 4374:     evpsadbw(xtmp2, k0, xtmp2, xtmp1, true, vec_enc);
> 4375:     vpackuswb(dst, xtmp2, dst, vec_enc);

This doesn't look correct for say 512 bit vector length. At this point, the xtmp2 has 64-bit popcount results for lower 8 integers and dst has 64-bit popcount results for upper 8 integers. The vpackuswb does interleaving between xtmp2 and dst at 128 bit lanes so the result is not in correct order.

src/hotspot/cpu/x86/x86.ad line 1871:

> 1869:       if ((vlen == 16) && !VM_Version::supports_avx512vlbw()) {
> 1870:         return false;
> 1871:       }

This restriction was not there before this patch. This extra check should be only when supports_avx512_vpopcntdq() is false.

src/hotspot/cpu/x86/x86.ad line 1876:

> 1874:       if ((vlen <= 4) || ((vlen == 8) && !VM_Version::supports_avx512vlbw())) {
> 1875:         return false;
> 1876:       }

This restriction was not there before this patch. This extra check should be only when supports_avx512_vpopcntdq() is false.

-------------

PR: https://git.openjdk.java.net/jdk/pull/7373