RFR: 8307609: RISC-V: Added support for Extract, Compress, Expand and other nodes for Vector API
Dingli Zhang
dzhang at openjdk.org
Fri May 12 11:17:47 UTC 2023
Hi all,
We have added support for Extract, Compress, Expand and other nodes for Vector
API. It was implemented by referring to RVV v1.0 [1]. Please take a look and
have some reviews. Thanks a lot.
In this PR, we will support these new nodes:
CompressM/CompressV/ExpandV
LoadVectorGather/StoreVectorScatter/LoadVectorGatherMasked/StoreVectorScatterMasked
Extract
VectorLongToMask/VectorMaskToLong
PopulateIndex
VectorLongToMask/VectorMaskToLong
VectorMaskTrueCount/VectorMaskFirstTrue
VectorInsert
At the same time, we refactored methods such as
`match_rule_supported_vector_mask`. All implemented vector nodes support mask
operations by default now, so we also added mask nodes for all implemented
nodes.
By the way, we will implement the VectorTest node in the next PR.
We can use the tests under `test/jdk/jdk/incubator/vector` to print the
compilation log for most of the new nodes. And we can use the following
command to print the compilation log of a jtreg test case:
$ jtreg \
-v:default \
-concurrency:16 -timeout:50 \
-javaoption:-XX:+UnlockExperimentalVMOptions \
-javaoption:-XX:+UseRVV \
-javaoption:-XX:+PrintOptoAssembly \
-javaoption:-XX:LogFile=log_name.log \
-jdk:build/linux-riscv64-server-fastdebug/jdk \
-compilejdk:build/linux-x86_64-server-release/images/jdk \
<test-case>
### CompressM/CompressV/ExpandV
There is no inverse vdecompress provided in RVV, as this operation can be
readily synthesized using iota and a masked vrgather in `ExpandV`.
We can use `test/jdk/jdk/incubator/vector/Float256VectorTests.java` to emit
these nodes and the compilation log is as follows:
## CompressM
2aa addi R29, R10, #16 # ptr, #@addP_reg_imm
2ae mcompress V0, V30 # KILL R30
2c2 vstoremask V2, V0
2ce storeV [R7], V2 # vector (rvv)
2d6 bgeu R29, R28, B47 #@cmpP_branch P=0.000100 C=-1.000000
## CompressV
0ee addi R29, R10, #16 # ptr, #@addP_reg_imm
0f2 vcompress V1, V2, V0
0fe storeV [R7], V1 # vector (rvv)
106 bgeu R29, R28, B10 #@cmpP_branch P=0.000100 C=-1.000000
## ExpandV
0ee addi R29, R10, #16 # ptr, #@addP_reg_imm
0f2 vexpand V3, V2, V0
102 storeV [R7], V3 # vector (rvv)
10a bgeu R29, R28, B10 #@cmpP_branch P=0.000100 C=-1.000000
### LoadVectorGather/StoreVectorScatter/LoadVectorGatherMasked/StoreVectorScatterMasked
We use the vsoxei32_v instruction regardless of what sew is set to. The
indexMap in fromArray is an int array, so the index is always 32 bits. Because
index stores the index value, and vs2 of vsoxei32_v requires an offset, we need
to multiply the value corresponding to idx by the number of bytes of data width.
We can use `test/jdk/jdk/incubator/vector/Float256VectorLoadStoreTests.java` to
emit these nodes and the compilation log is as follows:
## LoadVectorGather
7ee B56: # out( B26 ) <- in( B55 ) Freq: 338.569
7ee spill [sp, #144] -> R7 # spill size = 64
7f0 spill [sp, #192] -> V3 # vector spill size = 256
7f8 gather_load V1, [R7], V3 # KILL V2
808 j B26 #@branch
## StoreVectorScatter
290 loadV V1, [R7] # vector (rvv)
298 addi R7, R8, #16 # ptr, #@addP_reg_imm
29c spill [sp, #32] -> V3 # vector spill size = 256
2a4 scatter_store [R7], V3, V1 # KILL V2
2b4 # pop frame 208
## LoadVectorGatherMasked
41a addi R30, R10, #16 # ptr, #@addP_reg_imm
41e spill [sp, #48] -> V3 # vector spill size = 256
426 gather_load_masked V1, [R7], V3, V0 # KILL V2
43a storeV [R28], V1 # vector (rvv)
442 bgeu R30, R29, B46 #@cmpP_branch P=0.000100 C=-1.000000
## StoreVectorScatterMasked
2ae vloadmask V0, V1
2b6 spill [sp, #8] -> R7 # spill size = 64
2b8 addi R7, R7, #16 # ptr, #@addP_reg_imm
2ba spill [sp, #48] -> V3 # vector spill size = 256
2c2 scatter_store_masked [R7], V3, V2, V0 # KILL V1
2d2 # pop frame 224
### Extract
Extract is used to return the element from a vector with the given index.
We can use `test/jdk/jdk/incubator/vector/*MaxVectorTests.java` to emit these
nodes and the compilation log is as follows:
## Extract
0fa loadV V1, [R11] # vector (rvv)
102 add R11, R19, R30 # ptr, #@addP_reg_reg
106 extract R15, V1, #0 # KILL V2
112 extract R12, V1, #1 # KILL V2
122 extract R13, V1, #2 # KILL V2
132 bgeu R14, R7, B44 #@cmpU_branch P=0.000001 C=-1.000000
## ExtractL
0fa loadV V1, [R11] # vector (rvv)
102 add R11, R19, R28 # ptr, #@addP_reg_reg
106 extractL R15, V1, #0 # KILL V2
112 extractL R13, V1, #1 # KILL V2
122 extractL R14, V1, #2 # KILL V2
132 bgeu R7, R10, B44 #@cmpU_branch P=0.000001 C=-1.000000
## ExtractF
0fa loadV V1, [R12] # vector (rvv)
102 add R12, R19, R28 # ptr, #@addP_reg_reg
106 extractF F0, V1, #0 # KILL V2
112 extractF F2, V1, #1 # KILL V2
122 extractF F1, V1, #2 # KILL V2
132 bgeu R7, R11, B44 #@cmpU_branch P=0.000001 C=-1.000000
## ExtractD
0fa loadV V1, [R13] # vector (rvv)
102 add R13, R19, R28 # ptr, #@addP_reg_reg
106 extractD F0, V1, #0 # KILL V2
112 extractD F1, V1, #1 # KILL V2
122 extractD F2, V1, #2 # KILL V2
132 bgeu R7, R12, B44 #@cmpU_branch P=0.000001 C=-1.000000
### AndV/OrV/XorV masked
We can use `Byte128VectorTests.java` to emit these nodes and the compilation
log is as follows:
## AndV masked
1d0 B30: # out( B57 B31 ) <- in( B29 ) Freq: 75.1104
1d0 loadV V3, [R15] # vector (rvv)
1d8 vloadmask V0, V1
1e0 vand_masked V2, V3, V0
1e8 spill [sp, #48] -> R14 # spill size = 64
1ea add R14, R14, R31 # ptr, #@addP_reg_reg
1ec addi R31, R14, #16 # ptr, #@addP_reg_imm
1f0 bgeu R9, R29, B57 #@cmpU_branch P=0.000001 C=-1.000000
## OrV masked
1d0 B30: # out( B57 B31 ) <- in( B29 ) Freq: 75.1104
1d0 loadV V3, [R15] # vector (rvv)
1d8 vloadmask V0, V1
1e0 vor_masked V2, V3, V0
1e8 spill [sp, #48] -> R14 # spill size = 64
1ea add R14, R14, R31 # ptr, #@addP_reg_reg
1ec addi R31, R14, #16 # ptr, #@addP_reg_imm
1f0 bgeu R9, R29, B57 #@cmpU_branch P=0.000001 C=-1.000000
## XorV masked
1d0 B30: # out( B57 B31 ) <- in( B29 ) Freq: 75.1104
1d0 loadV V3, [R15] # vector (rvv)
1d8 vloadmask V0, V1
1e0 vxor_masked V2, V3, V0
1e8 spill [sp, #48] -> R14 # spill size = 64
1ea add R14, R14, R31 # ptr, #@addP_reg_reg
1ec addi R31, R14, #16 # ptr, #@addP_reg_imm
1f0 bgeu R9, R29, B57 #@cmpU_branch P=0.000001 C=-1.000000
### VectorLongToMask/VectorMaskToLong
We can use `VectorMaskLoadStoreTest.java` and `Float256VectorTests.java` to
emit these nodes and the compilation log is as follows:
## VectorLongToMask
05e B3: # out( B29 B4 ) <- in( B22 B2 ) Freq: 1
05e vmask_fromlong V0, R30
066 vstoremask V1, V0
072 addi R7, R10, #16 # ptr, #@addP_reg_imm
076 storeV [R7], V1 # vector (rvv)
## VectorMaskToLong
064 addi R7, R7, #16 # ptr, #@addP_reg_imm
066 loadV V1, [R7] # vector (rvv)
06e vloadmask V0, V1
076 vmask_tolong R7, V0
084 li R29, #8 # int, #@loadConI
086 bgeu R12, R29, B5 #@cmpU_branch P=0.000001 C=-1.000000
### PopulateIndex
We need `PopulateIndexNode` to enable the vectorization of operations with loop
induction variable by extending current scope of C2 superword vectorizable
packs, just like [JDK-8280510](https://bugs.openjdk.java.net/browse/JDK-8280510).
With this we can vectorize some operations in loop with the induction variable
operand, such as below.
for (int i = 0; i < count; i++) {
b[i] = a[i] * i;
}
Final compilation log for above loop expression is like below.
add R16, R12, R15 # ptr, #@addP_reg_reg
addi R17, R16, #16 # ptr, #@addP_reg_imm
loadV V1, [R17] # vector (rvv)
add R15, R14, R15 # ptr, #@addP_reg_reg
addi R17, R15, #16 # ptr, #@addP_reg_imm
addiw R18, R30, #8 #@addI_reg_imm
populateindex V3, R30, #1 # KILL V2, R9
vmul.vv V1, V3, V1 #@vmulI
storeV [R17], V1 # vector (rvv)
Hotspot jtreg has existing tests in `compiler/c2/cr7192963/Test*Vect.java` and
will be all passed.
### VectorLongToMask/VectorMaskToLong
We can use `VectorMaskLoadStoreTest.java` and `Float256VectorTests.java` to
emit these nodes and the compilation log is as follows:
## VectorLongToMask
05e B3: # out( B29 B4 ) <- in( B22 B2 ) Freq: 1
05e vmask_fromlong V0, R30
066 vstoremask V1, V0
072 addi R7, R10, #16 # ptr, #@addP_reg_imm
076 storeV [R7], V1 # vector (rvv)
## VectorMaskToLong
064 addi R7, R7, #16 # ptr, #@addP_reg_imm
066 loadV V1, [R7] # vector (rvv)
06e vloadmask V0, V1
076 vmask_tolong R7, V0
084 li R29, #8 # int, #@loadConI
086 bgeu R12, R29, B5 #@cmpU_branch P=0.000001 C=-1.000000
### VectorMaskTrueCount/VectorMaskFirstTrue
We can use `Double128VectorTests.java` to emit these nodes and the compilation
log is as follows:
## VectorMaskTrueCount
050 addi R7, R7, #16 # ptr, #@addP_reg_imm
052 loadV V1, [R7] # vector (rvv)
05a vloadmask V0, V1
062 vmask_truecount R10, V0
06a # pop frame 32
## VectorMaskFirstTrue
070 loadV V1, [R7] # vector (rvv)
078 vmv.v.i V2, #0 #@replicateL_imm5
080 spill V1 -> V3 # vector spill size = 256
084 # reinterpret V3 # do nothing
084 vmaskcmp V0, V3, V2, #4
090 vmask_firsttrue R8, V0 # KILL V30
09c li R28, #2 # int, #@loadConI
09e bge R8, R28, B42 #@cmpI_branch P=0.000000 C=5952.000000
### VectorInsert
We can use `test/hotspot/jtreg/compiler/vectorapi/TestVectorInsertByte.java` to
emit lt32 node and the compilation log is as follows:
05e B4: # out( B13 B5 ) <- in( B3 ) Freq: 0.999997
05e loadV V1, [R30] # vector (rvv)
066 li R28, #0 # int, #@loadConI
068 lwu R29, [R7, #120] # loadN, compressed ptr, #@loadN ! Field: TestVectorInsertByte.rb
06c decode_heap_oop R29, R29 #@decodeHeapOop
06e insertI_index_lt32 V1, V1, R28, #0
082 lwu R7, [R29, #12] # range, #@loadRange
086 NullCheck R29
In order to cover the case where idx is greater than 31, we need to modify
`TestVectorInsertByte.java`:
diff --git a/test/hotspot/jtreg/compiler/vectorapi/TestVectorInsertByte.java b/test/hotspot/jtreg/compiler/vectorapi/TestVectorInsertByte.java
index 7969b7bea40..480d6bec074 100644
--- a/test/hotspot/jtreg/compiler/vectorapi/TestVectorInsertByte.java
+++ b/test/hotspot/jtreg/compiler/vectorapi/TestVectorInsertByte.java
@@ -51,7 +51,7 @@ public class TestVectorInsertByte {
static void testByteVectorInsert() {
ByteVector av = ByteVector.fromArray(SPECIESb, ab, 0);
- av = av.withLane(0, (byte) (0));
+ av = av.withLane(32, (byte) (0));
av.intoArray(rb, 0);
}
Then the compilation log is as follows:
060 B4: # out( B13 B5 ) <- in( B3 ) Freq: 0.999997
060 loadV V1, [R30] # vector (rvv)
068 li R28, #0 # int, #@loadConI
06a lwu R29, [R7, #120] # loadN, compressed ptr, #@loadN ! Field: TestVectorInsertByte.rb
06e decode_heap_oop R29, R29 #@decodeHeapOop
070 insertI_index V1, V1, R28, #32 # KILL R7, V2
088 lwu R28, [R29, #12] # range, #@loadRange
08c NullCheck R29
### MaskAll masked
SVE can use the case `shuffleTest()` in `Int64VectorTests.java` to emit
vmaskAllI_masked, and the function `vector_needs_partial_operations` will
judge and emit masked vmaskAllI node. RISC-V uses vsetvl to set vector element
length, so we do not need partial operations. But we can use
`vector_needs_partial_operations` to cover vmaskAllI_masked this point.
Apply patch:
diff --git a/src/hotspot/cpu/riscv/riscv.ad b/src/hotspot/cpu/riscv/riscv.ad
index 6c5ceb9c359..b4ef13768fc 100644
--- a/src/hotspot/cpu/riscv/riscv.ad
+++ b/src/hotspot/cpu/riscv/riscv.ad
@@ -1968,7 +1968,19 @@ const bool Matcher::match_rule_supported_vector_masked(int opcode, int vlen, Bas
}
const bool Matcher::vector_needs_partial_operations(Node* node, const TypeVect* vt) {
- return false;
+ if (UseRVV == 0) {
+ return false;
+ }
+ switch(node->Opcode()) {
+ case Op_MaskAll:
+ return !node->in(1)->is_Con();
+ default:
+ return false;
+ }
}
const bool Matcher::vector_needs_load_shuffle(BasicType elem_bt, int vlen) {
Then the compilation log is as follows:
0c8 B7: # out( B13 B8 ) <- in( B12 B6 ) Freq: 0.999999
0c8 addi R7, R30, #16 # ptr, #@addP_reg_imm
0cc vmask_gen_imm V0, #2
0d4 vmaskAllI_masked V30, R31, V0 # KILL V1
0e4 spill V30 -> V0 # vmask spill size = 32
0e8 vstoremask V1, V0 # elem size is #4 byte[s]
0f4 storeV [R7], V1 # vector (rvv)
[1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc
## Testing:
qemu with UseRVV:
- [ ] Tier1 tests (release)
- [ ] Tier2 tests (release)
- [ ] Tier3 tests (release)
- [x] test/jdk/jdk/incubator/vector (fastdebug)
- [x] test/hotspot/jtreg/compiler/c2/cr7192963/Test*Vect.java
-------------
Commit messages:
- Remove VectorTest
- Merge remote-tracking branch 'upstream/master' into JDK-8307609
- Optimize vmask_gen_imm
- Add VectorTest
- FFix some vsetvli_helper location
- Remove useless INSN and simplify gather load
- Refactor match_rule_supported_vector
- 8307609: RISC-V: Added support for Extract, Compress, Expand and other nodes for Vector API
Changes: https://git.openjdk.org/jdk/pull/13862/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13862&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8307609
Stats: 1591 lines in 6 files changed: 1425 ins; 107 del; 59 mod
Patch: https://git.openjdk.org/jdk/pull/13862.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/13862/head:pull/13862
PR: https://git.openjdk.org/jdk/pull/13862
More information about the hotspot-compiler-dev
mailing list