RFR: 8320347: Emulate vblendvp[sd] on ECore
Jatin Bhateja
jbhateja at openjdk.org
Tue Nov 21 19:07:06 UTC 2023
On Mon, 20 Nov 2023 21:32:54 GMT, Volodymyr Paprotski <duke at openjdk.org> wrote:
> > Hi @vpaprotsk , please add checks to skip special emulation for 128 bit vectors at applicable places, as per section "4.1.8.4 256-bit Variable Blend Instructions" of x86 optimization manual variable blends are micro-coded only for 256 bit vectors.
>
> I went and remeasured performance of 128-bit vectors with `-XX:MaxVectorSize=16`...
>
> ```
> =============== BEFORE ===============
> Benchmark Mode Cnt Score Error Units
> MaxMinOptimizeTest.dAdd avgt 3 77.232 ± 0.034 us/op
> MaxMinOptimizeTest.dMax avgt 3 149.242 ± 2.373 us/op
> MaxMinOptimizeTest.dMin avgt 3 150.000 ± 1.763 us/op
> MaxMinOptimizeTest.dMul avgt 3 77.237 ± 0.020 us/op
> MaxMinOptimizeTest.fAdd avgt 3 77.156 ± 0.012 us/op
> MaxMinOptimizeTest.fMax avgt 3 110.729 ± 0.743 us/op
> MaxMinOptimizeTest.fMin avgt 3 110.716 ± 0.157 us/op
> MaxMinOptimizeTest.fMul avgt 3 77.157 ± 0.017 us/op
> Benchmark (SIZE) Mode Cnt Score Error Units
> VectorSignum.floatSignum 256 avgt 3 134.137 ± 4.586 ns/op
> VectorSignum.floatSignum 512 avgt 3 258.117 ± 0.518 ns/op
> VectorSignum.floatSignum 1024 avgt 3 512.706 ± 5.924 ns/op
> VectorSignum.floatSignum 2048 avgt 3 979.276 ± 46.734 ns/op
> VectorSignum.doubleSignum 256 avgt 3 233.108 ± 5.314 ns/op
> VectorSignum.doubleSignum 512 avgt 3 457.757 ± 3.537 ns/op
> VectorSignum.doubleSignum 1024 avgt 3 907.037 ± 2.768 ns/op
> VectorSignum.doubleSignum 2048 avgt 3 1816.200 ± 15.869 ns/op
>
> =============== AFTER ===============
> Benchmark Mode Cnt Score Error Units
> MaxMinOptimizeTest.dAdd avgt 3 77.238 ± 0.092 us/op
> MaxMinOptimizeTest.dMax avgt 3 106.636 ± 0.072 us/op
> MaxMinOptimizeTest.dMin avgt 3 103.060 ± 0.129 us/op
> MaxMinOptimizeTest.dMul avgt 3 77.233 ± 0.044 us/op
> MaxMinOptimizeTest.fAdd avgt 3 77.158 ± 0.021 us/op
> MaxMinOptimizeTest.fMax avgt 3 105.256 ± 1.682 us/op
> MaxMinOptimizeTest.fMin avgt 3 103.126 ± 0.049 us/op
> MaxMinOptimizeTest.fMul avgt 3 77.155 ± 0.019 us/op
> Benchmark (SIZE) Mode Cnt Score Error Units
> VectorSignum.floatSignum 256 avgt 3 60.523 ± 0.026 ns/op
> VectorSignum.floatSignum 512 avgt 3 118.415 ± 0.076 ns/op
> VectorSignum.floatSignum 1024 avgt 3 235.203 ± 0.323 ns/op
> VectorSignum.floatSignum 2048 avgt 3 467.230 ± 0.144 ns/op
> VectorSignum.doubleSignum 256 avgt 3 120.955 ± 0.217 ns/op
> VectorSignum.doubleSignum 512 avgt 3 241.753 ± 0.371 ns/op
> VectorSignum.doubleSignum 1024 avgt 3 498.055 ± 0.410 ns/op
> VectorSignum.doubleSignum 2048 avgt 3 974.891 ± 1.472 ns/op
> ```
>
> For Max/Min, keeping this patch gets us up to 40%, and `VectorSignum.*Signum`, the fix is actually >2x.
I see following results on cascade lake
-XX:+UnlockDiagnosticVMOptions -XX:-EnableX86ECoreOpts -XX:MaxVectorSize=16
Benchmark Mode Cnt Score Error Units
MaxMinOptimizeTest.dMax avgt 2 119.131 us/op
MaxMinOptimizeTest.dMax:asm avgt NaN ---
MaxMinOptimizeTest.dMin avgt 2 117.812 us/op
MaxMinOptimizeTest.dMin:asm avgt NaN ---
-XX:+UnlockDiagnosticVMOptions -XX:+EnableX86ECoreOpts -XX:MaxVectorSize=16
Benchmark Mode Cnt Score Error Units
MaxMinOptimizeTest.dMax avgt 2 128.076 us/op
MaxMinOptimizeTest.dMax:asm avgt NaN ---
MaxMinOptimizeTest.dMin avgt 2 126.978 us/op
MaxMinOptimizeTest.dMin:asm avgt NaN ---
-------------
PR Comment: https://git.openjdk.org/jdk/pull/16716#issuecomment-1821505204
More information about the hotspot-dev
mailing list