RFR: 8320347: Emulate vblendvp[sd] on ECore
Jatin Bhateja
jbhateja at openjdk.org
Tue Nov 21 19:27:10 UTC 2023
On Mon, 20 Nov 2023 21:32:54 GMT, Volodymyr Paprotski <duke at openjdk.org> wrote:
> > Hi @vpaprotsk , please add checks to skip special emulation for 128 bit vectors at applicable places, as per section "4.1.8.4 256-bit Variable Blend Instructions" of x86 optimization manual variable blends are micro-coded only for 256 bit vectors.
>
> I went and remeasured performance of 128-bit vectors with `-XX:MaxVectorSize=16`...
>
> ```
> =============== BEFORE ===============
> Benchmark Mode Cnt Score Error Units
> MaxMinOptimizeTest.dAdd avgt 3 77.232 ± 0.034 us/op
> MaxMinOptimizeTest.dMax avgt 3 149.242 ± 2.373 us/op
> MaxMinOptimizeTest.dMin avgt 3 150.000 ± 1.763 us/op
> MaxMinOptimizeTest.dMul avgt 3 77.237 ± 0.020 us/op
> MaxMinOptimizeTest.fAdd avgt 3 77.156 ± 0.012 us/op
> MaxMinOptimizeTest.fMax avgt 3 110.729 ± 0.743 us/op
> MaxMinOptimizeTest.fMin avgt 3 110.716 ± 0.157 us/op
> MaxMinOptimizeTest.fMul avgt 3 77.157 ± 0.017 us/op
> Benchmark (SIZE) Mode Cnt Score Error Units
> VectorSignum.floatSignum 256 avgt 3 134.137 ± 4.586 ns/op
> VectorSignum.floatSignum 512 avgt 3 258.117 ± 0.518 ns/op
> VectorSignum.floatSignum 1024 avgt 3 512.706 ± 5.924 ns/op
> VectorSignum.floatSignum 2048 avgt 3 979.276 ± 46.734 ns/op
> VectorSignum.doubleSignum 256 avgt 3 233.108 ± 5.314 ns/op
> VectorSignum.doubleSignum 512 avgt 3 457.757 ± 3.537 ns/op
> VectorSignum.doubleSignum 1024 avgt 3 907.037 ± 2.768 ns/op
> VectorSignum.doubleSignum 2048 avgt 3 1816.200 ± 15.869 ns/op
>
> =============== AFTER ===============
> Benchmark Mode Cnt Score Error Units
> MaxMinOptimizeTest.dAdd avgt 3 77.238 ± 0.092 us/op
> MaxMinOptimizeTest.dMax avgt 3 106.636 ± 0.072 us/op
> MaxMinOptimizeTest.dMin avgt 3 103.060 ± 0.129 us/op
> MaxMinOptimizeTest.dMul avgt 3 77.233 ± 0.044 us/op
> MaxMinOptimizeTest.fAdd avgt 3 77.158 ± 0.021 us/op
> MaxMinOptimizeTest.fMax avgt 3 105.256 ± 1.682 us/op
> MaxMinOptimizeTest.fMin avgt 3 103.126 ± 0.049 us/op
> MaxMinOptimizeTest.fMul avgt 3 77.155 ± 0.019 us/op
> Benchmark (SIZE) Mode Cnt Score Error Units
> VectorSignum.floatSignum 256 avgt 3 60.523 ± 0.026 ns/op
> VectorSignum.floatSignum 512 avgt 3 118.415 ± 0.076 ns/op
> VectorSignum.floatSignum 1024 avgt 3 235.203 ± 0.323 ns/op
> VectorSignum.floatSignum 2048 avgt 3 467.230 ± 0.144 ns/op
> VectorSignum.doubleSignum 256 avgt 3 120.955 ± 0.217 ns/op
> VectorSignum.doubleSignum 512 avgt 3 241.753 ± 0.371 ns/op
> VectorSignum.doubleSignum 1024 avgt 3 498.055 ± 0.410 ns/op
> VectorSignum.doubleSignum 2048 avgt 3 974.891 ± 1.472 ns/op
> ```
>
> For Max/Min, keeping this patch gets us up to 40%, and `VectorSignum.*Signum`, the fix is actually >2x.
Thanks for clarification, I check latency for variable blend is 5 cycles on E-cores and that explains the perf improvements.
https://uops.info/html-lat/ADL-E/VBLENDVPS_XMM_XMM_XMM_XMM-Measurements.html
-------------
PR Comment: https://git.openjdk.org/jdk/pull/16716#issuecomment-1821547115
More information about the hotspot-compiler-dev
mailing list