[vectorIntrinsics] Vector API for RISC-V
zifeihan
caogui at iscas.ac.cn
Tue Sep 20 06:24:49 UTC 2022
Hi Paul,
Thank you for your reply and suggestions.
1. We will be submitting the code to https://github.com/openjdk/jdk next.
2. We will study and research Valhalla, and look forward to participating in it in the future.
3. We appreciate the JMH benchmark tests provided by the Panama repository, which will make it easier for us to do some performance verification and testing.
4. We have discussed testing IR node generation by assertion before, but we have not found a reasonable way to do it, and we will try to verify it by HotSpot's IR testing framework.
Thanks again,
zifeihan
> On Sep 20, 2022, at 02:34, Paul Sandoz <paul.sandoz at oracle.com> wrote:
>
> Hi,
>
> Thank you, very encouraging, and looks a reasonable plan, some suggestions below. Support for the Vector API should more easily result in better support for the auto-vectorizer.
>
> 1. I think you can submit PRs to https://github.com/openjdk/jdk/ and then those changes can be brought into the Panama repo if need be. That assumes support for RISC-V V extension does not require substantial adjustments to C2 or the API, and from what you say RISC-V does not require such adjustments.
> Note: going forward I expect most architectural development to focus on alignment with Valhalla’s value classes/types and support for vector calling conventions. There is also work to research support for FP16 vectors, which is also connected with Valhalla, which can be considered more incremental.
>
> 2. The Panama repository also has support for generating JMH benchmarks in addition to unit tests, you may find those helpful, rather than writing your own.
> Testing-wise I would have liked to revamp the test framework to generate Java tests from a Java code and leverage HotSpot’s IR Test Framework [1]. Alas, I don’t have the time right now.
> We could do more to align with HotSpot’s IR framework to not only assert on results, but also assert that C2 IR nodes are generated. (It may be the test generator needs to query the platform for supported vector nodes, via say enhancements to the WhiteBox API).
> While JMH performance tests have their place using the IR framework is I think better approach longer term for testing.
>
> Paul.
>
> [1] https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/lib/ir_framework/README.md
>
>
>> On Sep 19, 2022, at 7:10 AM, zifeihan <caogui at iscas.ac.cn> wrote:
>>
>> # Summary
>>
>> The implementation of vector nodes plays an important role in the implementation of the Vector-API. In the current RISC-V backend implementation of the OpenJDK, some vector nodes have been implemented using the RISC-V V extensions, e.g. `LoadVector,StoreVector,AddVB` and so on. With these vector node implementations, the C2 compiler is able to handle some specific vector computations faster and with better performance. However, the current vector node implementations are still lacking compared to AARCH64's SVE/NEON and X86's avx512, for example: `Op_LoadVectorGather,Op_StoreVectorScatter,AndReductionV` and so on.
>> Therefore, we currently want to make more vector node implementations based on RISC-V V extensions for the RISC-V backend of OpenJDK first.
>>
>> # Status
>>
>> According to our understanding, the C2 vector node of the RISC-V V extension currently exists to allow the program to use more of the RISC-V V extension during runtime, thus reducing the number of assembly instructions (using a single instruction, multiple data mode), thus allowing for faster execution of the program. Currently, the Vector API works fine on the OpenJDK RISC-V platform, but because some vector nodes are missing, the Vector API C2 mode uses the normal C2 nodes for the unimplemented C2 vector nodes, so that the lack of vectorized nodes does not cause the Vector API to be used in the OpenJDK RISC-V platform. API is not available on the OpenJDK RISC-V platform due to the lack of vectorized nodes.
>>
>> https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int256VectorTests.java#ANDReduceInt256VectorTests
>>
>> This test performs AndReduce operations on a set of data. By printing the C2 execution log of the method, we can see that the method also performs C2 compilation, but it is implemented using normal C2 nodes and does not use the RISC-V V extensions.
>>
>> # Example
>>
>> The following implementation of AndReduce for the Vector API uses the RISC-V V extension, which provides 32 vector registers and an instruction set to manipulate them. These instruction sets enable vectorization operations similar to AARCH64's SVE, where the RISC-V V extension instruction set precedes operations on vector register data, Some RISC-V V extended instruction sets operate on registers that can contain scalar (normal) registers, for example `vop.vx vd, vs2, rs1, vm # integer vector-scalar vd[i] = vs2[i] op x[rs1]` . For the case where more RISC-V V extension instructions operate on vector registers, the data needs to be loaded into the vector registers first, and then the RISC-V V extension instruction set operates on the vector registers. The Vector API's AndReduce is similar to the existing AddReduce in that it loads data from memory/scalar registers into vector registers, then operates on the vector registers, and finally moves the data to the scalar registers. Since the loading and storage of vector data has already been implemented (src/hotspot/cpu/riscv/riscv_v.ad), we refer to `AddReductionVI` and implement `AndReductionV`, the main implementation node of AndReduce for the Vector API.
>>
>> ```
>> instruct reduce_andI(iRegINoSp dst, iRegIorL2I src1, vReg src2, vReg tmp) %{
>> predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_INT);
>> match(Set dst (AndReductionV src1 src2));
>> effect(TEMP tmp);
>> ins_cost(VEC_COST);
>> format %{ "vmv.s.x $tmp, $src1\t#@reduce_andI\n\t"
>> "vredand.vs $tmp, $src2, $tmp\n\t"
>> "vmv.x.s $dst, $tmp" %}
>> ins_encode %{
>> __ vsetvli(t0, x0, Assembler::e32);
>> __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register);
>> __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg),
>> as_VectorRegister($tmp$$reg));
>> __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg));
>> %}
>> ins_pipe(pipe_slow);
>> %}
>> ```
>>
>> The `T_INT` data type is implemented here, and the implementation is given in a different node for `T_BYTE, T_SHORT, T_LONG`. After implementation, the compilation log of the https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int256VectorTests.java#ANDReduceInt256VectorTests method is printed, and RISC-V is enabled. After implementation, the compilation log of the method is printed, and the RISC-V V extension is enabled, so that the execution of the method matches the new AndReductionV node.
>>
>> ```
>> 27c B21: # out( B25 B22 ) <- in( B20 ) Freq: 32.4376
>> 27c # castII of R8, #@castII
>> 27c addw R7, R8, zr #@convI2L_reg_reg
>> 280 slli R29, R7, (#2 & 0x3f) #@lShiftL_reg_imm
>> 284 spill [sp, #24] -> R7 # spill size = 64
>> 288 add R7, R7, R29 # ptr, #@addP_reg_reg
>> 28c addi R7, R7, #16 # ptr, #@addP_reg_imm
>> 290 vle V2, [R7] #@loadV
>> 298 ....
>> 2c0 vmv.s.x V1, R7 #@reduce_andI
>> vredand.vs V1, V2, V1
>> vmv.x.s R28, V1
>> ```
>>
>> # Test tips
>>
>> 1. After implementing each vector node, write test cases for that node, perform rigorous functional testing, and perform complete testing of the vector in jtreg.
>> 2. Print the JAVA test case method using the vector node, and analyze the compilation log to confirm that the optimization of the C2 Vector Node is occurring correctly.
>> 2. We plan to add JMH test cases for each C2 vector node to test the performance comparison before and after adding.
>> 3. Since no physical machine capable of executing RISC-V V extensions has been found, the above tests were performed with the RISC-V V extensions v1.0 enabled in QEMU.
>>
>> # Performance Test
>>
>> Continue using https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int256VectorTests.java#ANDReduceInt256VectorTests to Test the performance before and after implementing the RISC-V V extensions added.
>>
>> Method ADDReduceInt256VectorTests, ANDReduceInt256VectorTests, ORReduceInt256VectorTests, XORReduceInt256VectorTests, negInt256VectorTests and NEGInt256VectorTests under `test/jdk/jdk/incubator/vector` are tested. The sum of execution time shows ~50.7% reduction on average.
>>
>> # Goals and roadmap
>>
>> Considering code safety and testing, we plan to implement the Vector API step by step according to the C2 Vector Node types required by the Vector API. For example, we will separate `AndReductionV, OrReductionV, XorReductionV` into one class, `VectorCastB2X, VectorCastS2X, VectorCastD2X` into one class, and so on, and then we will submit PRs upstream according to the C2 Vector Node type. In order to keep the code safe, we will implement the simple vector nodes first, from simple to hard, and avoid modifying other public code in the process for the time being.
>> After RISC-V's missing vectorization nodes are added, we will adjust and announce the next work plan in time. These are our goals and plans, and we welcome suggestions and corrections from the community.
>
More information about the riscv-port-dev
mailing list