RFR: 8320347: Emulate vblendvp[sd] on ECore [v2]

Tue Nov 21 18:36:18 UTC 2023

On Tue, 21 Nov 2023 00:37:23 GMT, Volodymyr Paprotski <duke at openjdk.org> wrote:

>> Splitting vblendvp[sd] into boolean operations is bit faster on ECore, get up to 30% gain
>> 
>> 
>> =============== BEFORE ===============
>> Benchmark                 (SIZE)  Mode  Cnt    Score   Error  Units
>> VectorSignum.floatSignum     256  avgt    3   77.766 ± 0.049  ns/op
>> VectorSignum.floatSignum     512  avgt    3  154.889 ± 0.242  ns/op
>> VectorSignum.floatSignum    1024  avgt    3  306.130 ± 0.605  ns/op
>> VectorSignum.floatSignum    2048  avgt    3  609.965 ± 0.927  ns/op
>> VectorSignum.doubleSignum     256  avgt    3   151.874 ± 1.748  ns/op
>> VectorSignum.doubleSignum     512  avgt    3   303.080 ± 0.310  ns/op
>> VectorSignum.doubleSignum    1024  avgt    3   607.517 ± 0.597  ns/op
>> VectorSignum.doubleSignum    2048  avgt    3  1214.282 ± 1.834  ns/op
>> Benchmark                Mode  Cnt    Score   Error  Units
>> MaxMinOptimizeTest.dAdd  avgt    3   77.240 ± 0.029  us/op
>> MaxMinOptimizeTest.dMax  avgt    3  137.334 ± 0.128  us/op
>> MaxMinOptimizeTest.dMin  avgt    3  137.160 ± 0.465  us/op
>> MaxMinOptimizeTest.dMul  avgt    3   77.231 ± 0.051  us/op
>> MaxMinOptimizeTest.fAdd  avgt    3   77.165 ± 0.003  us/op
>> MaxMinOptimizeTest.fMax  avgt    3  107.428 ± 1.501  us/op
>> MaxMinOptimizeTest.fMin  avgt    3  107.186 ± 0.022  us/op
>> MaxMinOptimizeTest.fMul  avgt    3   77.164 ± 0.012  us/op
>> 
>> =============== AFTER ===============
>> Benchmark                 (SIZE)  Mode  Cnt    Score   Error  Units
>> VectorSignum.floatSignum     256  avgt    3   61.816 ± 1.980  ns/op
>> VectorSignum.floatSignum     512  avgt    3  117.251 ± 0.052  ns/op
>> VectorSignum.floatSignum    1024  avgt    3  231.356 ± 0.397  ns/op
>> VectorSignum.floatSignum    2048  avgt    3  458.904 ± 0.774  ns/op
>> VectorSignum.doubleSignum     256  avgt    3  121.449 ± 0.184  ns/op
>> VectorSignum.doubleSignum     512  avgt    3  241.662 ± 0.189  ns/op
>> VectorSignum.doubleSignum    1024  avgt    3  482.365 ± 0.165  ns/op
>> VectorSignum.doubleSignum    2048  avgt    3  962.412 ± 1.401  ns/op
>> Benchmark                Mode  Cnt    Score   Error  Units
>> MaxMinOptimizeTest.dAdd  avgt    3   77.240 ± 0.029  us/op
>> MaxMinOptimizeTest.dMax  avgt    3  125.701 ± 0.082  us/op
>> MaxMinOptimizeTest.dMin  avgt    3  124.704 ± 0.119  us/op
>> MaxMinOptimizeTest.dMul  avgt    3   77.232 ± 0.028  us/op
>> MaxMinOptimizeTest.fAdd  avgt    3   77.169 ± 0.103  us/op
>> MaxMinOptimizeTest.fMax  avgt    3   97.939 ± 0.477  us/op
>> MaxMinOptimizeTest.fMin  avgt    3   98.012 ± 0.154  us/op
>> MaxMinO...
>
> Volodymyr Paprotski has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:
> 
>  - Merge remote-tracking branch 'jdk/master' into vp-ecore2
>  - review comments
>  - emulate vblend on ecores

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1112:

> 1110:   void (MacroAssembler::*vblend)(XMMRegister, XMMRegister, XMMRegister, XMMRegister, int, bool, XMMRegister);
> 1111:   void (MacroAssembler::*vmaxmin)(XMMRegister, XMMRegister, XMMRegister, int);
> 1112:   void (MacroAssembler::*vcmp)(XMMRegister, XMMRegister, XMMRegister, int, int);

We do support C++11 dialect, you can use following declarations.
using vblend = void (*) (XMMRegister, XMMRegister, XMMRegister, XMMRegister, int, bool, XMMRegister);

src/hotspot/cpu/x86/macroAssembler_x86.cpp line 3577:

> 3575:   if (EnableX86ECoreOpts && scratch_available && dst_available) {
> 3576:     XMMRegister full_mask = mask;
> 3577:     if (!fully_masked) {

name change suggestion for better understanding. fully_masked -> compute_mask

src/hotspot/cpu/x86/macroAssembler_x86.cpp line 3601:

> 3599:   if (EnableX86ECoreOpts && scratch_available && dst_available) {
> 3600:     XMMRegister full_mask = mask;
> 3601:     if (!fully_masked) {

Same a above fully_masked -> compute_mask, remove full_mask.

src/hotspot/cpu/x86/x86.ad line 7840:

> 7838:   match(Set dst (VectorBlend (Binary src1 src2) mask));
> 7839:   format %{ "vector_blend  $dst,$src1,$src2,$mask\t! using $vtmp as TEMP" %}
> 7840:   effect(TEMP vtmp, TEMP dst);

TEMP dst can be removed.

src/hotspot/cpu/x86/x86_64.ad line 4519:

> 4517:       __ vcmpps($btmp$$XMMRegister, $atmp$$XMMRegister, $atmp$$XMMRegister, Assembler::_false, vector_len);
> 4518:       __ vblendvps($dst$$XMMRegister, $tmp$$XMMRegister, $atmp$$XMMRegister, $btmp$$XMMRegister, vector_len, true, $btmp$$XMMRegister);
> 4519:     }

Please move into a new macro assembly routine.

src/hotspot/cpu/x86/x86_64.ad line 4568:

> 4566:       __ vcmppd($btmp$$XMMRegister, $atmp$$XMMRegister, $atmp$$XMMRegister, Assembler::_false, vector_len);
> 4567:       __ vblendvpd($dst$$XMMRegister, $tmp$$XMMRegister, $atmp$$XMMRegister, $btmp$$XMMRegister, vector_len, true, $btmp$$XMMRegister);
> 4568:     }

Please move to a new macro assembly routine.

src/hotspot/cpu/x86/x86_64.ad line 4616:

> 4614:       __ vcmpps($btmp$$XMMRegister, $atmp$$XMMRegister, $atmp$$XMMRegister, Assembler::_false, vector_len);
> 4615:       __ vblendvps($dst$$XMMRegister, $tmp$$XMMRegister, $atmp$$XMMRegister, $btmp$$XMMRegister, vector_len, true, $btmp$$XMMRegister);
> 4616:     }

Please move to a new macro assembly routine.

src/hotspot/cpu/x86/x86_64.ad line 4645:

> 4643:      "vcmppd.unordered $btmp,$atmp,$atmp        \n\t"
> 4644:      "vblendvpd        $dst,$tmp,$atmp,$btmp    \n\t"
> 4645:   %}

Format block may not be valid for e-cores, you can replace it with following to be consistent on both the cores. 
         `                minD $dst, $a, $b \t! using %tmp, %atmp and %btmp as TEMP `

src/hotspot/cpu/x86/x86_64.ad line 4665:

> 4663:       __ vcmppd($btmp$$XMMRegister, $atmp$$XMMRegister, $atmp$$XMMRegister, Assembler::_false, vector_len);
> 4664:       __ vblendvpd($dst$$XMMRegister, $tmp$$XMMRegister, $atmp$$XMMRegister, $btmp$$XMMRegister, vector_len, true, $btmp$$XMMRegister);
> 4665:     }

Please move this logic into a new macro assembly routine.

test/hotspot/jtreg/compiler/vectorization/TestSignumVector.java line 112:

> 110:             if (fout[i] != 1.0)   throw new RuntimeException("Expected positive numbers in second half of array: " + java.util.Arrays.toString(fout));
> 111:         }
> 112:     }

Its ok to add correctness check here, but test only intend to perform check IR validations, there are detailed function tests in following files
test/hotspot/jtreg/compiler/intrinsics/math/TestSignumIntrinsic.java
test/hotspot/jtreg/compiler/c2/cr6340864/TestFloatVect.java
test/hotspot/jtreg/compiler/c2/cr6340864/TestDoubleVect.java

test/hotspot/jtreg/compiler/vectorization/runner/BasicFloatOpTest.java line 119:

> 117:             }
> 118:         }
> 119: 

Test performs IR validation, you can also update existing functional test with more test values.
test/hotspot/jtreg/compiler/intrinsics/math/TestFpMinMaxIntrinsics.java

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16716#discussion_r1400923128
PR Review Comment: https://git.openjdk.org/jdk/pull/16716#discussion_r1400953279
PR Review Comment: https://git.openjdk.org/jdk/pull/16716#discussion_r1400987897
PR Review Comment: https://git.openjdk.org/jdk/pull/16716#discussion_r1400983194
PR Review Comment: https://git.openjdk.org/jdk/pull/16716#discussion_r1400985305
PR Review Comment: https://git.openjdk.org/jdk/pull/16716#discussion_r1400985798
PR Review Comment: https://git.openjdk.org/jdk/pull/16716#discussion_r1400986113
PR Review Comment: https://git.openjdk.org/jdk/pull/16716#discussion_r1400969890
PR Review Comment: https://git.openjdk.org/jdk/pull/16716#discussion_r1400976470
PR Review Comment: https://git.openjdk.org/jdk/pull/16716#discussion_r1401000788
PR Review Comment: https://git.openjdk.org/jdk/pull/16716#discussion_r1401006644