RFR: 8283232: x86: Improve vector broadcast operations [v8]

Tue Jul 26 05:57:39 UTC 2022

On Sat, 23 Jul 2022 13:18:05 GMT, Quan Anh Mai <duke at openjdk.org> wrote:

>> Hi,
>> 
>> This patch improves the generation of broadcasting a scalar in several ways:
>> 
>> - As it has been pointed out, dumping the whole vector into the constant table is costly in terms of code size, this patch minimises this overhead for vector replicate of constants. Also, options are available for constants to be generated with more alignment so that vector load can be made efficiently without crossing cache lines.
>> - Vector broadcasting should prefer rematerialising to spilling when register pressure is high.
>> - Load vectors using the same kind (integral vs floating point) of instructions as that of the results to avoid potential data bypass delay
>> 
>> With this patch, the result of the added benchmark, which performs some operations with a really high register pressure, on my machine with Intel i7-7700HQ (avx2) is as follow:
>> 
>>                                               Before          After
>>     Benchmark                  Mode  Cnt   Score   Error   Score   Error  Units     Gain
>>     SpiltReplicate.testDouble  avgt    5  42.621 ± 0.598  38.771 ± 0.797  ns/op   +9.03%
>>     SpiltReplicate.testFloat   avgt    5  42.245 ± 1.464  38.603 ± 0.367  ns/op   +8.62%
>>     SpiltReplicate.testInt     avgt    5  20.581 ± 5.791  13.755 ± 0.375  ns/op  +33.17%
>>     SpiltReplicate.testLong    avgt    5  17.794 ± 4.781  13.663 ± 0.387  ns/op  +23.22%
>> 
>> As expected, the constant table sizes shrink significantly from 1024 bytes to 256 bytes for `long`/`double` and 128 bytes for `int`/`float` cases.
>> 
>> This patch also removes some redundant code paths and renames some incorrectly named instructions.
>> 
>> Thank you very much.
>
> Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 18 commits:
> 
>  - rename
>  - consolidate sse checks
>  - benchmark
>  - fix
>  - Merge branch 'master' into improveReplicate
>  - remove duplicate
>  - unsignness
>  - rematerializing input count
>  - fix comparison
>  - fix rematerialize, constant deduplication
>  - ... and 8 more: https://git.openjdk.org/jdk/compare/0599a05f...6c10f9ad

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1662:

> 1660:       case 64: vmovups(dst, src, Assembler::AVX_512bit); break;
> 1661:       default: ShouldNotReachHere();
> 1662:     }

Vector Load/store from memory happens from dedicated ports, can you elaborate why this change will benefit.

src/hotspot/cpu/x86/macroAssembler_x86.cpp line 4388:

> 4386: 
> 4387: void MacroAssembler::vallones(XMMRegister dst, int vector_len) {
> 4388:   // vpcmpeqd has special dependency treatment so it should be preferred to vpternlogd

Comment is not clear, adding relevant reference will add more value.

-------------

PR: https://git.openjdk.org/jdk/pull/7832