RFR: 8320347: Emulate vblendvp[sd] on ECore [v2]
Jatin Bhateja
jbhateja at openjdk.org
Tue Nov 21 18:36:18 UTC 2023
On Tue, 21 Nov 2023 00:37:23 GMT, Volodymyr Paprotski <duke at openjdk.org> wrote:
>> Splitting vblendvp[sd] into boolean operations is bit faster on ECore, get up to 30% gain
>>
>>
>> =============== BEFORE ===============
>> Benchmark (SIZE) Mode Cnt Score Error Units
>> VectorSignum.floatSignum 256 avgt 3 77.766 ± 0.049 ns/op
>> VectorSignum.floatSignum 512 avgt 3 154.889 ± 0.242 ns/op
>> VectorSignum.floatSignum 1024 avgt 3 306.130 ± 0.605 ns/op
>> VectorSignum.floatSignum 2048 avgt 3 609.965 ± 0.927 ns/op
>> VectorSignum.doubleSignum 256 avgt 3 151.874 ± 1.748 ns/op
>> VectorSignum.doubleSignum 512 avgt 3 303.080 ± 0.310 ns/op
>> VectorSignum.doubleSignum 1024 avgt 3 607.517 ± 0.597 ns/op
>> VectorSignum.doubleSignum 2048 avgt 3 1214.282 ± 1.834 ns/op
>> Benchmark Mode Cnt Score Error Units
>> MaxMinOptimizeTest.dAdd avgt 3 77.240 ± 0.029 us/op
>> MaxMinOptimizeTest.dMax avgt 3 137.334 ± 0.128 us/op
>> MaxMinOptimizeTest.dMin avgt 3 137.160 ± 0.465 us/op
>> MaxMinOptimizeTest.dMul avgt 3 77.231 ± 0.051 us/op
>> MaxMinOptimizeTest.fAdd avgt 3 77.165 ± 0.003 us/op
>> MaxMinOptimizeTest.fMax avgt 3 107.428 ± 1.501 us/op
>> MaxMinOptimizeTest.fMin avgt 3 107.186 ± 0.022 us/op
>> MaxMinOptimizeTest.fMul avgt 3 77.164 ± 0.012 us/op
>>
>> =============== AFTER ===============
>> Benchmark (SIZE) Mode Cnt Score Error Units
>> VectorSignum.floatSignum 256 avgt 3 61.816 ± 1.980 ns/op
>> VectorSignum.floatSignum 512 avgt 3 117.251 ± 0.052 ns/op
>> VectorSignum.floatSignum 1024 avgt 3 231.356 ± 0.397 ns/op
>> VectorSignum.floatSignum 2048 avgt 3 458.904 ± 0.774 ns/op
>> VectorSignum.doubleSignum 256 avgt 3 121.449 ± 0.184 ns/op
>> VectorSignum.doubleSignum 512 avgt 3 241.662 ± 0.189 ns/op
>> VectorSignum.doubleSignum 1024 avgt 3 482.365 ± 0.165 ns/op
>> VectorSignum.doubleSignum 2048 avgt 3 962.412 ± 1.401 ns/op
>> Benchmark Mode Cnt Score Error Units
>> MaxMinOptimizeTest.dAdd avgt 3 77.240 ± 0.029 us/op
>> MaxMinOptimizeTest.dMax avgt 3 125.701 ± 0.082 us/op
>> MaxMinOptimizeTest.dMin avgt 3 124.704 ± 0.119 us/op
>> MaxMinOptimizeTest.dMul avgt 3 77.232 ± 0.028 us/op
>> MaxMinOptimizeTest.fAdd avgt 3 77.169 ± 0.103 us/op
>> MaxMinOptimizeTest.fMax avgt 3 97.939 ± 0.477 us/op
>> MaxMinOptimizeTest.fMin avgt 3 98.012 ± 0.154 us/op
>> MaxMinO...
>
> Volodymyr Paprotski has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision:
>
> - Merge remote-tracking branch 'jdk/master' into vp-ecore2
> - review comments
> - emulate vblend on ecores
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1112:
> 1110: void (MacroAssembler::*vblend)(XMMRegister, XMMRegister, XMMRegister, XMMRegister, int, bool, XMMRegister);
> 1111: void (MacroAssembler::*vmaxmin)(XMMRegister, XMMRegister, XMMRegister, int);
> 1112: void (MacroAssembler::*vcmp)(XMMRegister, XMMRegister, XMMRegister, int, int);
We do support C++11 dialect, you can use following declarations.
using vblend = void (*) (XMMRegister, XMMRegister, XMMRegister, XMMRegister, int, bool, XMMRegister);
src/hotspot/cpu/x86/macroAssembler_x86.cpp line 3577:
> 3575: if (EnableX86ECoreOpts && scratch_available && dst_available) {
> 3576: XMMRegister full_mask = mask;
> 3577: if (!fully_masked) {
name change suggestion for better understanding. fully_masked -> compute_mask
src/hotspot/cpu/x86/macroAssembler_x86.cpp line 3601:
> 3599: if (EnableX86ECoreOpts && scratch_available && dst_available) {
> 3600: XMMRegister full_mask = mask;
> 3601: if (!fully_masked) {
Same a above fully_masked -> compute_mask, remove full_mask.
src/hotspot/cpu/x86/x86.ad line 7840:
> 7838: match(Set dst (VectorBlend (Binary src1 src2) mask));
> 7839: format %{ "vector_blend $dst,$src1,$src2,$mask\t! using $vtmp as TEMP" %}
> 7840: effect(TEMP vtmp, TEMP dst);
TEMP dst can be removed.
src/hotspot/cpu/x86/x86_64.ad line 4519:
> 4517: __ vcmpps($btmp$$XMMRegister, $atmp$$XMMRegister, $atmp$$XMMRegister, Assembler::_false, vector_len);
> 4518: __ vblendvps($dst$$XMMRegister, $tmp$$XMMRegister, $atmp$$XMMRegister, $btmp$$XMMRegister, vector_len, true, $btmp$$XMMRegister);
> 4519: }
Please move into a new macro assembly routine.
src/hotspot/cpu/x86/x86_64.ad line 4568:
> 4566: __ vcmppd($btmp$$XMMRegister, $atmp$$XMMRegister, $atmp$$XMMRegister, Assembler::_false, vector_len);
> 4567: __ vblendvpd($dst$$XMMRegister, $tmp$$XMMRegister, $atmp$$XMMRegister, $btmp$$XMMRegister, vector_len, true, $btmp$$XMMRegister);
> 4568: }
Please move to a new macro assembly routine.
src/hotspot/cpu/x86/x86_64.ad line 4616:
> 4614: __ vcmpps($btmp$$XMMRegister, $atmp$$XMMRegister, $atmp$$XMMRegister, Assembler::_false, vector_len);
> 4615: __ vblendvps($dst$$XMMRegister, $tmp$$XMMRegister, $atmp$$XMMRegister, $btmp$$XMMRegister, vector_len, true, $btmp$$XMMRegister);
> 4616: }
Please move to a new macro assembly routine.
src/hotspot/cpu/x86/x86_64.ad line 4645:
> 4643: "vcmppd.unordered $btmp,$atmp,$atmp \n\t"
> 4644: "vblendvpd $dst,$tmp,$atmp,$btmp \n\t"
> 4645: %}
Format block may not be valid for e-cores, you can replace it with following to be consistent on both the cores.
` minD $dst, $a, $b \t! using %tmp, %atmp and %btmp as TEMP `
src/hotspot/cpu/x86/x86_64.ad line 4665:
> 4663: __ vcmppd($btmp$$XMMRegister, $atmp$$XMMRegister, $atmp$$XMMRegister, Assembler::_false, vector_len);
> 4664: __ vblendvpd($dst$$XMMRegister, $tmp$$XMMRegister, $atmp$$XMMRegister, $btmp$$XMMRegister, vector_len, true, $btmp$$XMMRegister);
> 4665: }
Please move this logic into a new macro assembly routine.
test/hotspot/jtreg/compiler/vectorization/TestSignumVector.java line 112:
> 110: if (fout[i] != 1.0) throw new RuntimeException("Expected positive numbers in second half of array: " + java.util.Arrays.toString(fout));
> 111: }
> 112: }
Its ok to add correctness check here, but test only intend to perform check IR validations, there are detailed function tests in following files
test/hotspot/jtreg/compiler/intrinsics/math/TestSignumIntrinsic.java
test/hotspot/jtreg/compiler/c2/cr6340864/TestFloatVect.java
test/hotspot/jtreg/compiler/c2/cr6340864/TestDoubleVect.java
test/hotspot/jtreg/compiler/vectorization/runner/BasicFloatOpTest.java line 119:
> 117: }
> 118: }
> 119:
Test performs IR validation, you can also update existing functional test with more test values.
test/hotspot/jtreg/compiler/intrinsics/math/TestFpMinMaxIntrinsics.java
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/16716#discussion_r1400923128
PR Review Comment: https://git.openjdk.org/jdk/pull/16716#discussion_r1400953279
PR Review Comment: https://git.openjdk.org/jdk/pull/16716#discussion_r1400987897
PR Review Comment: https://git.openjdk.org/jdk/pull/16716#discussion_r1400983194
PR Review Comment: https://git.openjdk.org/jdk/pull/16716#discussion_r1400985305
PR Review Comment: https://git.openjdk.org/jdk/pull/16716#discussion_r1400985798
PR Review Comment: https://git.openjdk.org/jdk/pull/16716#discussion_r1400986113
PR Review Comment: https://git.openjdk.org/jdk/pull/16716#discussion_r1400969890
PR Review Comment: https://git.openjdk.org/jdk/pull/16716#discussion_r1400976470
PR Review Comment: https://git.openjdk.org/jdk/pull/16716#discussion_r1401000788
PR Review Comment: https://git.openjdk.org/jdk/pull/16716#discussion_r1401006644
More information about the hotspot-compiler-dev
mailing list