RFR: 8281375: Accelerate bitCount operation for AVX2 and AVX512 target. [v7]

Sandhya Viswanathan sviswanathan at openjdk.java.net
Fri Mar 4 00:17:09 UTC 2022


On Tue, 1 Mar 2022 17:08:51 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Summary of changes:
>> 
>> - Patch extends existing vectorized bitCount optimization added with [JDK-8278868](https://bugs.openjdk.java.net/browse/JDK-8278868) and emits optimized JIT sequence for AVX2 and other AVX512 targets which do not support avx512_vpopcntdq feature.
>> - Since PopCountVI/PopCountVL node emit different instruction sequence based on the target features hence a rudimentary cost mode has been added which influences the SLP unrolling factor to prevent generating bloated main loops.
>> - Following are the performance results of an existing [JMH micro](https://github.com/jatin-bhateja/jdk/blob/master/test/micro/org/openjdk/bench/vm/compiler/VectorBitCount.java) over various X86 targets.
>> 
>> 
>> Benchmark | SIZE | Baseline AVX2 (ns/op) | Withopt AVX2 (ns/op) | Gain % | Baseline AVX3 (ns/op) | Withopt AVX3 (ns/op) | Gain % | Baseline AVX3 (VPOPCOUNTDQ) | Withopt AVX3 (VPOCOUNTDQ) | Gain %
>> -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
>> VectorBitCount.WithSuperword.intBitCount | 1024 | 1089.799 | 420.156 | 159.3796114 | 1083.92 | 203.958 | 431.442748 | 88.958 | 60.096 | 48.02649095
>> VectorBitCount.WithSuperword.longBitCount | 1024 | 417.458 | 413.859 | 0.869619846 | 417.203 | 214.949 | 94.09394787 | 105.954 | 117.019 | -9.455729411
>> 
>> Please review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   8281375: Review comments resolved.

src/hotspot/cpu/x86/assembler_x86.cpp line 8329:

> 8327: void Assembler::vpunpckhdq(XMMRegister dst, XMMRegister nds, XMMRegister src, int vector_len) {
> 8328:   assert(UseAVX > 0, "requires some form of AVX");
> 8329:   InstructionAttr attributes(vector_len, /* vex_w */ false, /* legacy_mode */ _legacy_mode_bw, /* no_mask_reg */ true, /* uses_vl */ true);

legacy_mode should be false here.

src/hotspot/cpu/x86/assembler_x86.cpp line 8336:

> 8334: void Assembler::vpunpckldq(XMMRegister dst, XMMRegister nds, XMMRegister src, int vector_len) {
> 8335:   assert(UseAVX > 0, "requires some form of AVX");
> 8336:   InstructionAttr attributes(vector_len, /* vex_w */ false, /* legacy_mode */ _legacy_mode_bw, /* no_mask_reg */ true, /* uses_vl */ true);

legacy_mode should be false here.

src/hotspot/cpu/x86/assembler_x86.cpp line 8341:

> 8339: }
> 8340: 
> 8341: // xmm/mem sourced byte/word/dword/qword replicate

The comment seems to be misplaced.

src/hotspot/cpu/x86/assembler_x86.cpp line 8342:

> 8340: 
> 8341: // xmm/mem sourced byte/word/dword/qword replicate
> 8342: void Assembler::evpsadbw(XMMRegister dst, KRegister mask, XMMRegister nds, XMMRegister src, bool merge, int vector_len) {

There is no masked version of evpsadbw documented which takes any k registers in the manual so we should be able to use the vpsadbw() defined before and remove evpsadbw()?

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4360:

> 4358:     vpopcntd(dst, src, vec_enc);
> 4359:   } else if (vec_enc == Assembler::AVX_512bit) {
> 4360:     assert(VM_Version::supports_avx512vlbw(), "");

I think check for supports_avx512bw() is enough here. Do we need vl?

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4374:

> 4372:     evpsadbw(dst, k0, dst, xtmp1, true, vec_enc);
> 4373:     evpunpckldq(xtmp2, k0, xtmp3, xtmp1, true, vec_enc);
> 4374:     evpsadbw(xtmp2, k0, xtmp2, xtmp1, true, vec_enc);

The merge masking can be set to false for the entire algorithm, because k is always all set and no merging at all needed.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4393:

> 4391:     vpunpckldq(xtmp2, xtmp3, xtmp1, vec_enc);
> 4392:     vpsadbw(xtmp2, xtmp2, xtmp1, vec_enc);
> 4393:     vpackuswb(dst, xtmp2, dst, vec_enc);

This code is common with the 512 bit version, vec_enc decides the proper encoding, could be merged together. This also then removes the need for adding the evpunpck* instructions. K0 mask is implicit.

src/hotspot/cpu/x86/x86.ad line 1870:

> 1868:     case Op_PopCountVI:
> 1869:       if (!VM_Version::supports_avx512_vpopcntdq() &&
> 1870:           (vlen == 16) && !VM_Version::supports_avx512vlbw()) {

For vlen==16, check for only avx512bw() is needed.

src/hotspot/cpu/x86/x86.ad line 1876:

> 1874:     case Op_PopCountVL:
> 1875:       if (!VM_Version::supports_avx512_vpopcntdq() &&
> 1876:           ((vlen <= 4) || ((vlen == 8) && !VM_Version::supports_avx512vlbw()))) {

In case of long vlen==8, check for only avx512bw() is needed.

-------------

PR: https://git.openjdk.java.net/jdk/pull/7373


More information about the hotspot-compiler-dev mailing list