RFR: 8320347: Emulate vblendvp[sd] on ECore

Sandhya Viswanathan sviswanathan at openjdk.org
Mon Nov 20 22:09:07 UTC 2023


On Fri, 17 Nov 2023 19:58:13 GMT, Volodymyr Paprotski <duke at openjdk.org> wrote:

> Splitting vblendvp[sd] into boolean operations is bit faster on ECore, get up to 30% gain
> 
> 
> =============== BEFORE ===============
> Benchmark                 (SIZE)  Mode  Cnt    Score   Error  Units
> VectorSignum.floatSignum     256  avgt    3   77.766 ± 0.049  ns/op
> VectorSignum.floatSignum     512  avgt    3  154.889 ± 0.242  ns/op
> VectorSignum.floatSignum    1024  avgt    3  306.130 ± 0.605  ns/op
> VectorSignum.floatSignum    2048  avgt    3  609.965 ± 0.927  ns/op
> VectorSignum.doubleSignum     256  avgt    3   151.874 ± 1.748  ns/op
> VectorSignum.doubleSignum     512  avgt    3   303.080 ± 0.310  ns/op
> VectorSignum.doubleSignum    1024  avgt    3   607.517 ± 0.597  ns/op
> VectorSignum.doubleSignum    2048  avgt    3  1214.282 ± 1.834  ns/op
> Benchmark                Mode  Cnt    Score   Error  Units
> MaxMinOptimizeTest.dAdd  avgt    3   77.240 ± 0.029  us/op
> MaxMinOptimizeTest.dMax  avgt    3  137.334 ± 0.128  us/op
> MaxMinOptimizeTest.dMin  avgt    3  137.160 ± 0.465  us/op
> MaxMinOptimizeTest.dMul  avgt    3   77.231 ± 0.051  us/op
> MaxMinOptimizeTest.fAdd  avgt    3   77.165 ± 0.003  us/op
> MaxMinOptimizeTest.fMax  avgt    3  107.428 ± 1.501  us/op
> MaxMinOptimizeTest.fMin  avgt    3  107.186 ± 0.022  us/op
> MaxMinOptimizeTest.fMul  avgt    3   77.164 ± 0.012  us/op
> 
> =============== AFTER ===============
> Benchmark                 (SIZE)  Mode  Cnt    Score   Error  Units
> VectorSignum.floatSignum     256  avgt    3   61.816 ± 1.980  ns/op
> VectorSignum.floatSignum     512  avgt    3  117.251 ± 0.052  ns/op
> VectorSignum.floatSignum    1024  avgt    3  231.356 ± 0.397  ns/op
> VectorSignum.floatSignum    2048  avgt    3  458.904 ± 0.774  ns/op
> VectorSignum.doubleSignum     256  avgt    3  121.449 ± 0.184  ns/op
> VectorSignum.doubleSignum     512  avgt    3  241.662 ± 0.189  ns/op
> VectorSignum.doubleSignum    1024  avgt    3  482.365 ± 0.165  ns/op
> VectorSignum.doubleSignum    2048  avgt    3  962.412 ± 1.401  ns/op
> Benchmark                Mode  Cnt    Score   Error  Units
> MaxMinOptimizeTest.dAdd  avgt    3   77.240 ± 0.029  us/op
> MaxMinOptimizeTest.dMax  avgt    3  125.701 ± 0.082  us/op
> MaxMinOptimizeTest.dMin  avgt    3  124.704 ± 0.119  us/op
> MaxMinOptimizeTest.dMul  avgt    3   77.232 ± 0.028  us/op
> MaxMinOptimizeTest.fAdd  avgt    3   77.169 ± 0.103  us/op
> MaxMinOptimizeTest.fMax  avgt    3   97.939 ± 0.477  us/op
> MaxMinOptimizeTest.fMin  avgt    3   98.012 ± 0.154  us/op
> MaxMinOptimizeTest.fMul  avgt    3   77.174 ± 0.012  us/op

src/hotspot/cpu/x86/x86.ad line 7844:

> 7842:     int vlen_enc = vector_length_encoding(this);
> 7843:     __ vpandn($vtmp$$XMMRegister, $mask$$XMMRegister, $src1$$XMMRegister, vlen_enc);
> 7844:     __ vpand($dst$$XMMRegister, $src2$$XMMRegister, $mask$$XMMRegister, vlen_enc);

May be we could code it as below to be consistent with other places:
`__ vpand($dst$$XMMRegister, $mask$$XMMRegister, $src2$$XMMRegister, vlen_enc);`

src/hotspot/cpu/x86/x86_64.ad line 4554:

> 4552:     __ vmaxsd($tmp$$XMMRegister, $atmp$$XMMRegister, $btmp$$XMMRegister);
> 4553:     __ vcmppd($btmp$$XMMRegister, $atmp$$XMMRegister, $atmp$$XMMRegister, Assembler::_false, vector_len);
> 4554:     __ vblendvpd($dst$$XMMRegister, $tmp$$XMMRegister, $atmp$$XMMRegister, $btmp$$XMMRegister, vector_len, true, $btmp$$XMMRegister);

As dst and mask (in this case btmp) need to be independent for EcoreOpt,  vblend dst here should be tmp or atmp followed by a move into dst.
Either this or have TEMP dst in effect for Ecore case.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16716#discussion_r1399788631
PR Review Comment: https://git.openjdk.org/jdk/pull/16716#discussion_r1399808720


More information about the hotspot-compiler-dev mailing list