RFR: 8307609: RISC-V: Added support for Extract, Compress, Expand and other nodes for Vector API

Fri May 12 11:17:47 UTC 2023

Hi all,

We have added support for Extract, Compress, Expand and other nodes for Vector
API. It was implemented by referring to RVV v1.0 [1]. Please take a look and
have some reviews. Thanks a lot.

In this PR, we will support these new nodes:

CompressM/CompressV/ExpandV
LoadVectorGather/StoreVectorScatter/LoadVectorGatherMasked/StoreVectorScatterMasked
Extract
VectorLongToMask/VectorMaskToLong
PopulateIndex
VectorLongToMask/VectorMaskToLong
VectorMaskTrueCount/VectorMaskFirstTrue
VectorInsert


At the same time, we refactored methods such as
`match_rule_supported_vector_mask`. All implemented vector nodes support mask
operations by default now, so we also added mask nodes for all implemented
nodes. 

By the way, we will implement the VectorTest node in the next PR.

We can use the tests under `test/jdk/jdk/incubator/vector` to print the 
compilation log for most of the new nodes. And we can use the following 
command to print the compilation log of a jtreg test case:


$ jtreg \
-v:default \
-concurrency:16 -timeout:50 \
-javaoption:-XX:+UnlockExperimentalVMOptions \
-javaoption:-XX:+UseRVV \
-javaoption:-XX:+PrintOptoAssembly \
-javaoption:-XX:LogFile=log_name.log \
-jdk:build/linux-riscv64-server-fastdebug/jdk \
-compilejdk:build/linux-x86_64-server-release/images/jdk \
<test-case>




### CompressM/CompressV/ExpandV

There is no inverse vdecompress provided in RVV, as this operation can be
readily synthesized using iota and a masked vrgather in `ExpandV`.

We can use `test/jdk/jdk/incubator/vector/Float256VectorTests.java` to emit
these nodes and the compilation log is as follows:


## CompressM
2aa     addi  R29, R10, #16	# ptr, #@addP_reg_imm
2ae     mcompress V0, V30	# KILL R30
2c2     vstoremask V2, V0
2ce     storeV [R7], V2	# vector (rvv)
2d6     bgeu  R29, R28, B47	#@cmpP_branch  P=0.000100 C=-1.000000

## CompressV
0ee     addi  R29, R10, #16	# ptr, #@addP_reg_imm
0f2     vcompress V1, V2, V0
0fe     storeV [R7], V1	# vector (rvv)
106     bgeu  R29, R28, B10	#@cmpP_branch  P=0.000100 C=-1.000000

## ExpandV
0ee     addi  R29, R10, #16	# ptr, #@addP_reg_imm
0f2     vexpand V3, V2, V0
102     storeV [R7], V3	# vector (rvv)
10a     bgeu  R29, R28, B10	#@cmpP_branch  P=0.000100 C=-1.000000




### LoadVectorGather/StoreVectorScatter/LoadVectorGatherMasked/StoreVectorScatterMasked

We use the vsoxei32_v instruction regardless of what sew is set to. The
indexMap in fromArray is an int array, so the index is always 32 bits. Because
index stores the index value, and vs2 of vsoxei32_v requires an offset, we need
to multiply the value corresponding to idx by the number of bytes of data width.

We can use `test/jdk/jdk/incubator/vector/Float256VectorLoadStoreTests.java` to
emit these nodes and the compilation log is as follows:


## LoadVectorGather
7ee     B56: #	out( B26 ) <- in( B55 )  Freq: 338.569
7ee     spill [sp, #144] -> R7	# spill size = 64
7f0     spill [sp, #192] -> V3	# vector spill size = 256
7f8     gather_load V1, [R7], V3	# KILL V2
808     j  B26	#@branch

## StoreVectorScatter
290     loadV V1, [R7]	# vector (rvv)
298     addi  R7, R8, #16	# ptr, #@addP_reg_imm
29c     spill [sp, #32] -> V3	# vector spill size = 256
2a4     scatter_store [R7], V3, V1	# KILL V2
2b4     # pop frame 208

## LoadVectorGatherMasked
41a     addi  R30, R10, #16	# ptr, #@addP_reg_imm
41e     spill [sp, #48] -> V3	# vector spill size = 256
426     gather_load_masked V1, [R7], V3, V0	# KILL V2
43a     storeV [R28], V1	# vector (rvv)
442     bgeu  R30, R29, B46	#@cmpP_branch  P=0.000100 C=-1.000000

## StoreVectorScatterMasked
2ae     vloadmask V0, V1
2b6     spill [sp, #8] -> R7	# spill size = 64
2b8     addi  R7, R7, #16	# ptr, #@addP_reg_imm
2ba     spill [sp, #48] -> V3	# vector spill size = 256
2c2     scatter_store_masked [R7], V3, V2, V0	# KILL V1
2d2     # pop frame 224


### Extract

Extract is used to return the element from a vector with the given index.

We can use `test/jdk/jdk/incubator/vector/*MaxVectorTests.java` to emit these
nodes and the compilation log is as follows:


## Extract
0fa     loadV V1, [R11]	# vector (rvv)
102     add R11, R19, R30	# ptr, #@addP_reg_reg
106     extract R15, V1, #0	# KILL V2
112     extract R12, V1, #1	# KILL V2
122     extract R13, V1, #2	# KILL V2
132     bgeu  R14, R7, B44	#@cmpU_branch  P=0.000001 C=-1.000000

## ExtractL
0fa     loadV V1, [R11]	# vector (rvv)
102     add R11, R19, R28	# ptr, #@addP_reg_reg
106     extractL R15, V1, #0	# KILL V2
112     extractL R13, V1, #1	# KILL V2
122     extractL R14, V1, #2	# KILL V2
132     bgeu  R7, R10, B44	#@cmpU_branch  P=0.000001 C=-1.000000

## ExtractF
0fa     loadV V1, [R12]	# vector (rvv)
102     add R12, R19, R28	# ptr, #@addP_reg_reg
106     extractF F0, V1, #0	# KILL V2
112     extractF F2, V1, #1	# KILL V2
122     extractF F1, V1, #2	# KILL V2
132     bgeu  R7, R11, B44	#@cmpU_branch  P=0.000001 C=-1.000000

## ExtractD
0fa     loadV V1, [R13]	# vector (rvv)
102     add R13, R19, R28	# ptr, #@addP_reg_reg
106     extractD F0, V1, #0	# KILL V2
112     extractD F1, V1, #1	# KILL V2
122     extractD F2, V1, #2	# KILL V2
132     bgeu  R7, R12, B44	#@cmpU_branch  P=0.000001 C=-1.000000


### AndV/OrV/XorV masked

We can use `Byte128VectorTests.java` to emit these nodes and the compilation
log is as follows:


## AndV masked
1d0     B30: #	out( B57 B31 ) <- in( B29 )  Freq: 75.1104
1d0     loadV V3, [R15]	# vector (rvv)
1d8     vloadmask V0, V1
1e0     vand_masked V2, V3, V0
1e8     spill [sp, #48] -> R14	# spill size = 64
1ea     add R14, R14, R31	# ptr, #@addP_reg_reg
1ec     addi  R31, R14, #16	# ptr, #@addP_reg_imm
1f0     bgeu  R9, R29, B57	#@cmpU_branch  P=0.000001 C=-1.000000

## OrV masked
1d0     B30: #	out( B57 B31 ) <- in( B29 )  Freq: 75.1104
1d0     loadV V3, [R15]	# vector (rvv)
1d8     vloadmask V0, V1
1e0     vor_masked V2, V3, V0
1e8     spill [sp, #48] -> R14	# spill size = 64
1ea     add R14, R14, R31	# ptr, #@addP_reg_reg
1ec     addi  R31, R14, #16	# ptr, #@addP_reg_imm
1f0     bgeu  R9, R29, B57	#@cmpU_branch  P=0.000001 C=-1.000000

## XorV masked
1d0     B30: #	out( B57 B31 ) <- in( B29 )  Freq: 75.1104
1d0     loadV V3, [R15]	# vector (rvv)
1d8     vloadmask V0, V1
1e0     vxor_masked V2, V3, V0
1e8     spill [sp, #48] -> R14	# spill size = 64
1ea     add R14, R14, R31	# ptr, #@addP_reg_reg
1ec     addi  R31, R14, #16	# ptr, #@addP_reg_imm
1f0     bgeu  R9, R29, B57	#@cmpU_branch  P=0.000001 C=-1.000000


### VectorLongToMask/VectorMaskToLong

We can use `VectorMaskLoadStoreTest.java` and `Float256VectorTests.java` to
emit these nodes and the compilation log is as follows:


## VectorLongToMask
05e     B3: #	out( B29 B4 ) <- in( B22 B2 )  Freq: 1
05e     vmask_fromlong V0, R30
066     vstoremask V1, V0
072     addi  R7, R10, #16	# ptr, #@addP_reg_imm
076     storeV [R7], V1	# vector (rvv)

## VectorMaskToLong
064     addi  R7, R7, #16	# ptr, #@addP_reg_imm
066     loadV V1, [R7]	# vector (rvv)
06e     vloadmask V0, V1
076     vmask_tolong R7, V0
084     li R29, #8	# int, #@loadConI
086     bgeu  R12, R29, B5	#@cmpU_branch  P=0.000001 C=-1.000000


### PopulateIndex

We need `PopulateIndexNode` to enable the vectorization of operations with loop
induction variable by extending current scope of C2 superword vectorizable
packs, just like [JDK-8280510](https://bugs.openjdk.java.net/browse/JDK-8280510).

With this we can vectorize some operations in loop with the induction variable
operand, such as below.


  for (int i = 0; i < count; i++) {
    b[i] = a[i] * i;
  }


Final compilation log  for above loop expression is like below.


add R16, R12, R15	# ptr, #@addP_reg_reg
addi  R17, R16, #16	# ptr, #@addP_reg_imm
loadV V1, [R17]	# vector (rvv)
add R15, R14, R15	# ptr, #@addP_reg_reg
addi  R17, R15, #16	# ptr, #@addP_reg_imm
addiw  R18, R30, #8	#@addI_reg_imm
populateindex V3, R30, #1	# KILL V2, R9
vmul.vv V1, V3, V1	#@vmulI
storeV [R17], V1	# vector (rvv)


Hotspot jtreg has existing tests in `compiler/c2/cr7192963/Test*Vect.java` and
will be all passed.


### VectorLongToMask/VectorMaskToLong

We can use `VectorMaskLoadStoreTest.java` and `Float256VectorTests.java` to
emit these nodes and the compilation log is as follows:


## VectorLongToMask
05e     B3: #	out( B29 B4 ) <- in( B22 B2 )  Freq: 1
05e     vmask_fromlong V0, R30
066     vstoremask V1, V0
072     addi  R7, R10, #16	# ptr, #@addP_reg_imm
076     storeV [R7], V1	# vector (rvv)

## VectorMaskToLong
064     addi  R7, R7, #16	# ptr, #@addP_reg_imm
066     loadV V1, [R7]	# vector (rvv)
06e     vloadmask V0, V1
076     vmask_tolong R7, V0
084     li R29, #8	# int, #@loadConI
086     bgeu  R12, R29, B5	#@cmpU_branch  P=0.000001 C=-1.000000



### VectorMaskTrueCount/VectorMaskFirstTrue

We can use `Double128VectorTests.java` to emit these nodes and the compilation
log is as follows:


## VectorMaskTrueCount
050     addi  R7, R7, #16	# ptr, #@addP_reg_imm
052     loadV V1, [R7]	# vector (rvv)
05a     vloadmask V0, V1
062     vmask_truecount R10, V0
06a     # pop frame 32

## VectorMaskFirstTrue
070     loadV V1, [R7]	# vector (rvv)
078     vmv.v.i  V2, #0	#@replicateL_imm5
080     spill V1 -> V3	# vector spill size = 256
084     # reinterpret V3	# do nothing
084     vmaskcmp V0, V3, V2, #4
090     vmask_firsttrue R8, V0	# KILL V30
09c     li R28, #2	# int, #@loadConI
09e     bge  R8, R28, B42	#@cmpI_branch  P=0.000000 C=5952.000000



### VectorInsert

We can use `test/hotspot/jtreg/compiler/vectorapi/TestVectorInsertByte.java` to
emit lt32 node and the compilation log is as follows:


05e     B4: # out( B13 B5 ) <- in( B3 )  Freq: 0.999997
05e     loadV V1, [R30] # vector (rvv)
066     li R28, #0 # int, #@loadConI
068     lwu  R29, [R7, #120] # loadN, compressed ptr, #@loadN ! Field: TestVectorInsertByte.rb
06c     decode_heap_oop  R29, R29 #@decodeHeapOop
06e     insertI_index_lt32 V1, V1, R28, #0
082     lwu  R7, [R29, #12] # range, #@loadRange
086     NullCheck R29


In order to cover the case where idx is greater than 31, we need to modify
`TestVectorInsertByte.java`：

diff --git a/test/hotspot/jtreg/compiler/vectorapi/TestVectorInsertByte.java b/test/hotspot/jtreg/compiler/vectorapi/TestVectorInsertByte.java
index 7969b7bea40..480d6bec074 100644
--- a/test/hotspot/jtreg/compiler/vectorapi/TestVectorInsertByte.java
+++ b/test/hotspot/jtreg/compiler/vectorapi/TestVectorInsertByte.java
@@ -51,7 +51,7 @@ public class TestVectorInsertByte {
 
     static void testByteVectorInsert() {
         ByteVector av = ByteVector.fromArray(SPECIESb, ab, 0);
-        av = av.withLane(0, (byte) (0));
+        av = av.withLane(32, (byte) (0));
         av.intoArray(rb, 0);
     }


Then the compilation log is as follows:


060     B4: # out( B13 B5 ) <- in( B3 )  Freq: 0.999997
060     loadV V1, [R30] # vector (rvv)
068     li R28, #0 # int, #@loadConI
06a     lwu  R29, [R7, #120] # loadN, compressed ptr, #@loadN ! Field: TestVectorInsertByte.rb
06e     decode_heap_oop  R29, R29 #@decodeHeapOop
070     insertI_index V1, V1, R28, #32 # KILL R7, V2
088     lwu  R28, [R29, #12] # range, #@loadRange
08c     NullCheck R29



### MaskAll masked

SVE can use the case `shuffleTest()` in `Int64VectorTests.java` to emit
vmaskAllI_masked, and the function `vector_needs_partial_operations` will
judge and emit masked vmaskAllI node. RISC-V uses vsetvl to set vector element
length, so we do not need partial operations. But we can use
`vector_needs_partial_operations` to cover vmaskAllI_masked this point.
Apply patch:


diff --git a/src/hotspot/cpu/riscv/riscv.ad b/src/hotspot/cpu/riscv/riscv.ad
index 6c5ceb9c359..b4ef13768fc 100644
--- a/src/hotspot/cpu/riscv/riscv.ad
+++ b/src/hotspot/cpu/riscv/riscv.ad
@@ -1968,7 +1968,19 @@ const bool Matcher::match_rule_supported_vector_masked(int opcode, int vlen, Bas
 }
 
 const bool Matcher::vector_needs_partial_operations(Node* node, const TypeVect* vt) {
-  return false;
+  if (UseRVV == 0) {
+      return false;
+    }
+  switch(node->Opcode()) {
+    case Op_MaskAll:
+        return !node->in(1)->is_Con();
+    default:
+      return false;
+  }
 }
 
 const bool Matcher::vector_needs_load_shuffle(BasicType elem_bt, int vlen) {


Then the compilation log is as follows:


0c8     B7: #	out( B13 B8 ) <- in( B12 B6 )  Freq: 0.999999
0c8     addi  R7, R30, #16	# ptr, #@addP_reg_imm
0cc     vmask_gen_imm V0, #2
0d4     vmaskAllI_masked V30, R31, V0	# KILL V1
0e4     spill V30 -> V0	# vmask spill size = 32
0e8     vstoremask V1, V0 # elem size is #4 byte[s]
0f4     storeV [R7], V1	# vector (rvv)



[1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc

## Testing:
qemu with UseRVV:
- [ ] Tier1 tests (release)
- [ ] Tier2 tests (release)
- [ ] Tier3 tests (release)
- [x] test/jdk/jdk/incubator/vector (fastdebug)
- [x] test/hotspot/jtreg/compiler/c2/cr7192963/Test*Vect.java

-------------

Commit messages:
 - Remove VectorTest
 - Merge remote-tracking branch 'upstream/master' into JDK-8307609
 - Optimize vmask_gen_imm
 - Add VectorTest
 - FFix some vsetvli_helper location
 - Remove useless INSN and simplify gather load
 - Refactor match_rule_supported_vector
 - 8307609: RISC-V: Added support for Extract, Compress, Expand and other nodes for Vector API

Changes: https://git.openjdk.org/jdk/pull/13862/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=13862&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8307609
  Stats: 1591 lines in 6 files changed: 1425 ins; 107 del; 59 mod
  Patch: https://git.openjdk.org/jdk/pull/13862.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/13862/head:pull/13862

PR: https://git.openjdk.org/jdk/pull/13862