RFR: 8283232: x86: Improve vector broadcast operations [v12]

Fri Jul 29 08:27:47 UTC 2022

On Wed, 27 Jul 2022 09:40:45 GMT, Quan Anh Mai <duke at openjdk.org> wrote:

>> Hi,
>> 
>> This patch improves the generation of broadcasting a scalar in several ways:
>> 
>> - As it has been pointed out, dumping the whole vector into the constant table is costly in terms of code size, this patch minimises this overhead for vector replicate of constants. Also, options are available for constants to be generated with more alignment so that vector load can be made efficiently without crossing cache lines.
>> - Vector broadcasting should prefer rematerialising to spilling when register pressure is high.
>> - Load vectors using the same kind (integral vs floating point) of instructions as that of the results to avoid potential data bypass delay
>> 
>> With this patch, the result of the added benchmark, which performs some operations with a really high register pressure, on my machine with Intel i7-7700HQ (avx2) is as follow:
>> 
>>                                               Before          After
>>     Benchmark                  Mode  Cnt   Score   Error   Score   Error  Units     Gain
>>     SpiltReplicate.testDouble  avgt    5  42.621 ± 0.598  38.771 ± 0.797  ns/op   +9.03%
>>     SpiltReplicate.testFloat   avgt    5  42.245 ± 1.464  38.603 ± 0.367  ns/op   +8.62%
>>     SpiltReplicate.testInt     avgt    5  20.581 ± 5.791  13.755 ± 0.375  ns/op  +33.17%
>>     SpiltReplicate.testLong    avgt    5  17.794 ± 4.781  13.663 ± 0.387  ns/op  +23.22%
>> 
>> As expected, the constant table sizes shrink significantly from 1024 bytes to 256 bytes for `long`/`double` and 128 bytes for `int`/`float` cases.
>> 
>> This patch also removes some redundant code paths and renames some incorrectly named instructions.
>> 
>> Thank you very much.
>
> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision:
> 
>   unnecessary TEMP dst

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1651:

> 1649:     case 32: vmovdqu(dst, src); break;
> 1650:     case 64: evmovdqul(dst, src, Assembler::AVX_512bit); break;
> 1651:     default: ShouldNotReachHere();

No change in this file, may be you can remove it from change set.

src/hotspot/cpu/x86/x86.ad line 4141:

> 4139: instruct ReplB_mem(vec dst, memory mem) %{
> 4140:   predicate(VM_Version::supports_avx2());
> 4141:   match(Set dst (ReplicateB (LoadB mem)));

Merge these rules and create a macro assembly routine for encoding block logic.

src/hotspot/cpu/x86/x86.ad line 4159:

> 4157: 
> 4158: instruct vReplS_reg(vec dst, rRegI src) %{
> 4159:   predicate(UseAVX >= 2);

Can be folded with below pattern, by pushing predicate into encoding block.

src/hotspot/cpu/x86/x86.ad line 4188:

> 4186:       assert(vlen == 8, "");
> 4187:       __ punpcklqdq($dst$$XMMRegister, $dst$$XMMRegister);
> 4188:     }

Please move this into macro assembly routine, it will look cleaner that way, after merging with above rule.

src/hotspot/cpu/x86/x86.ad line 4253:

> 4251:     int vlen_enc = vector_length_encoding(this);
> 4252:     if (VM_Version::supports_avx()) {
> 4253:       __ vbroadcastss($dst$$XMMRegister, addr, vlen_enc);

Emitting  vbroadcastss for all the vector sizes for Replicate[B/S/I] may result into domain switch over penalty, can be limited to only <=16 bytes replications and above that we can emit VPBROADCASTD.

src/hotspot/cpu/x86/x86.ad line 4261:

> 4259:         __ punpcklqdq($dst$$XMMRegister, $dst$$XMMRegister);
> 4260:       }
> 4261:     }

Please move into a new macro-assembly routine.

src/hotspot/cpu/x86/x86.ad line 4407:

> 4405:         __ punpcklqdq($dst$$XMMRegister, $dst$$XMMRegister);
> 4406:       }
> 4407:     }

Please move to a new macro assembly routine.

src/hotspot/cpu/x86/x86.ad line 4497:

> 4495:         __ punpcklqdq($dst$$XMMRegister, $dst$$XMMRegister);
> 4496:       }
> 4497:     }

Same as above.

src/hotspot/cpu/x86/x86.ad line 4541:

> 4539: instruct ReplD_reg(vec dst, vlRegD src) %{
> 4540:   predicate(UseSSE < 3);
> 4541:   match(Set dst (ReplicateD src));

Pushing predicates into encoding can fold these patterns.

src/hotspot/cpu/x86/x86.ad line 4579:

> 4577:       if (Matcher::vector_length_in_bytes(this) >= 16) {
> 4578:         __ punpcklqdq($dst$$XMMRegister, $dst$$XMMRegister);
> 4579:       }

Macro-assembly routine.

src/hotspot/share/opto/machnode.cpp line 478:

> 476:   // Stretching lots of inputs - don't do it.
> 477:   // A MachContant has the last input being the constant base
> 478:   if (req() > (is_MachConstant() ? 3U : 2U)) {

Earlier some of the nodes like add/sub/mul/divF_imm which were carrying 3 inputs were not getting cloned, now with change we may see them getting rematerialized before uses which may increase code size but of course it will reduced interferences. With earlier cap of 2 only Replicates were passing this check.

-------------

PR: https://git.openjdk.org/jdk/pull/7832