RFR: 8320347: Emulate vblendvp[sd] on ECore
Sandhya Viswanathan
sviswanathan at openjdk.org
Mon Nov 20 22:09:07 UTC 2023
On Fri, 17 Nov 2023 19:58:13 GMT, Volodymyr Paprotski <duke at openjdk.org> wrote:
> Splitting vblendvp[sd] into boolean operations is bit faster on ECore, get up to 30% gain
>
>
> =============== BEFORE ===============
> Benchmark (SIZE) Mode Cnt Score Error Units
> VectorSignum.floatSignum 256 avgt 3 77.766 ± 0.049 ns/op
> VectorSignum.floatSignum 512 avgt 3 154.889 ± 0.242 ns/op
> VectorSignum.floatSignum 1024 avgt 3 306.130 ± 0.605 ns/op
> VectorSignum.floatSignum 2048 avgt 3 609.965 ± 0.927 ns/op
> VectorSignum.doubleSignum 256 avgt 3 151.874 ± 1.748 ns/op
> VectorSignum.doubleSignum 512 avgt 3 303.080 ± 0.310 ns/op
> VectorSignum.doubleSignum 1024 avgt 3 607.517 ± 0.597 ns/op
> VectorSignum.doubleSignum 2048 avgt 3 1214.282 ± 1.834 ns/op
> Benchmark Mode Cnt Score Error Units
> MaxMinOptimizeTest.dAdd avgt 3 77.240 ± 0.029 us/op
> MaxMinOptimizeTest.dMax avgt 3 137.334 ± 0.128 us/op
> MaxMinOptimizeTest.dMin avgt 3 137.160 ± 0.465 us/op
> MaxMinOptimizeTest.dMul avgt 3 77.231 ± 0.051 us/op
> MaxMinOptimizeTest.fAdd avgt 3 77.165 ± 0.003 us/op
> MaxMinOptimizeTest.fMax avgt 3 107.428 ± 1.501 us/op
> MaxMinOptimizeTest.fMin avgt 3 107.186 ± 0.022 us/op
> MaxMinOptimizeTest.fMul avgt 3 77.164 ± 0.012 us/op
>
> =============== AFTER ===============
> Benchmark (SIZE) Mode Cnt Score Error Units
> VectorSignum.floatSignum 256 avgt 3 61.816 ± 1.980 ns/op
> VectorSignum.floatSignum 512 avgt 3 117.251 ± 0.052 ns/op
> VectorSignum.floatSignum 1024 avgt 3 231.356 ± 0.397 ns/op
> VectorSignum.floatSignum 2048 avgt 3 458.904 ± 0.774 ns/op
> VectorSignum.doubleSignum 256 avgt 3 121.449 ± 0.184 ns/op
> VectorSignum.doubleSignum 512 avgt 3 241.662 ± 0.189 ns/op
> VectorSignum.doubleSignum 1024 avgt 3 482.365 ± 0.165 ns/op
> VectorSignum.doubleSignum 2048 avgt 3 962.412 ± 1.401 ns/op
> Benchmark Mode Cnt Score Error Units
> MaxMinOptimizeTest.dAdd avgt 3 77.240 ± 0.029 us/op
> MaxMinOptimizeTest.dMax avgt 3 125.701 ± 0.082 us/op
> MaxMinOptimizeTest.dMin avgt 3 124.704 ± 0.119 us/op
> MaxMinOptimizeTest.dMul avgt 3 77.232 ± 0.028 us/op
> MaxMinOptimizeTest.fAdd avgt 3 77.169 ± 0.103 us/op
> MaxMinOptimizeTest.fMax avgt 3 97.939 ± 0.477 us/op
> MaxMinOptimizeTest.fMin avgt 3 98.012 ± 0.154 us/op
> MaxMinOptimizeTest.fMul avgt 3 77.174 ± 0.012 us/op
src/hotspot/cpu/x86/x86.ad line 7844:
> 7842: int vlen_enc = vector_length_encoding(this);
> 7843: __ vpandn($vtmp$$XMMRegister, $mask$$XMMRegister, $src1$$XMMRegister, vlen_enc);
> 7844: __ vpand($dst$$XMMRegister, $src2$$XMMRegister, $mask$$XMMRegister, vlen_enc);
May be we could code it as below to be consistent with other places:
`__ vpand($dst$$XMMRegister, $mask$$XMMRegister, $src2$$XMMRegister, vlen_enc);`
src/hotspot/cpu/x86/x86_64.ad line 4554:
> 4552: __ vmaxsd($tmp$$XMMRegister, $atmp$$XMMRegister, $btmp$$XMMRegister);
> 4553: __ vcmppd($btmp$$XMMRegister, $atmp$$XMMRegister, $atmp$$XMMRegister, Assembler::_false, vector_len);
> 4554: __ vblendvpd($dst$$XMMRegister, $tmp$$XMMRegister, $atmp$$XMMRegister, $btmp$$XMMRegister, vector_len, true, $btmp$$XMMRegister);
As dst and mask (in this case btmp) need to be independent for EcoreOpt, vblend dst here should be tmp or atmp followed by a move into dst.
Either this or have TEMP dst in effect for Ecore case.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/16716#discussion_r1399788631
PR Review Comment: https://git.openjdk.org/jdk/pull/16716#discussion_r1399808720
More information about the hotspot-compiler-dev
mailing list