RFR: 8283232: x86: Improve vector broadcast operations [v3]
Quan Anh Mai
duke at openjdk.java.net
Thu Mar 17 12:51:32 UTC 2022
On Thu, 17 Mar 2022 12:05:18 GMT, Quan Anh Mai <duke at openjdk.java.net> wrote:
>> Hi,
>>
>> This patch improves the generation of broadcasting a scalar in several ways:
>>
>> - Avoid potential data bypass delay which can be observed on some platforms by using the correct type of instruction if it does not require extra instructions.
>> - As it has been pointed out, dumping the whole vector into the constant table is costly in terms of code size, this patch minimises this overhead for vector replicate of constants. Also, options are available for constants to be generated with more alignment so that vector load can be made efficiently without crossing cache lines.
>> - Vector broadcasting should prefer rematerialising to spilling when register pressure is high.
>>
>> This patch also removes some redundant code paths and rename some incorrectly named instructions.
>>
>> Thank you very much.
>
> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision:
>
> fix rematerialize, constant deduplication
Doing a simple benchmark that has a lot of register pressure
@Benchmark
public long broadcastCon() {
var species = IntVector.SPECIES_PREFERRED;
var sum = IntVector.zero(species);
return sum.add(1).add(2).add(3).add(4).add(5).add(6).add(7).add(8)
.add(9).add(10).add(11).add(12).add(13).add(14).add(15).add(16)
.add(17).add(18).add(19).add(20).add(21).add(22).add(23).add(24)
.add(25).add(26).add(27).add(28).add(29).add(30).add(31).add(32)
.add(1).add(2).add(3).add(4).add(5).add(6).add(7).add(8)
.add(9).add(10).add(11).add(12).add(13).add(14).add(15).add(16)
.add(17).add(18).add(19).add(20).add(21).add(22).add(23).add(24)
.add(25).add(26).add(27).add(28).add(29).add(30).add(31).add(32)
.reinterpretAsLongs()
.lane(0);
}
provides the following result:
Before:
Benchmark Mode Cnt Score Error Units
VectorReplicate.broadcastCon avgt 5 16.417 ± 0.515 ns/op
After:
Benchmark Mode Cnt Score Error Units
VectorReplicate.broadcastCon avgt 5 13.851 ± 0.154 ns/op
The constant table size decreases from 1024 bytes to 128 bytes, which is much more manageable. The throughput improvement mostly comes from the vector being rematerialized instead of being spilt on the stack.
I have not been able to observe performance gain regarding bypass delay, which is expected as according to "Agner's optimisation manual on the micro architecture of Intel, AMD and VIA CPUs", Intel CPUs since Skylake seem to have only a few such delays.
Thank you very much.
-------------
PR: https://git.openjdk.java.net/jdk/pull/7832
More information about the hotspot-compiler-dev
mailing list