RFR: 8297549: RISC-V: Add support for Vector API vector load const operation
Fei Yang
fyang at openjdk.org
Mon Nov 28 01:39:46 UTC 2022
On Fri, 25 Nov 2022 10:20:23 GMT, Vladimir Kempik <vkempik at openjdk.org> wrote:
>> The instruction which is matched `VectorLoadConst` will create index starting from 0 and incremented by 1. In detail, the instruction populates the destination vector by setting the first element to 0 and monotonically incrementing the value by 1 for each subsequent element.
>>
>> We can add support of `VectorLoadConst` for RISC-V by `vid.v` . It was implemented by referring to RVV v1.0 [1].
>>
>> Tests are performed on qemu with parameter `-cpu rv64,v=true,vlen=256,vext_spec=v1.0`. By adding the `-XX:+PrintAssembly -XX:-TieredCompilation -XX:+LogCompilation -XX:LogFile=compile.log` parameter when executing micro-benchmark `IndexVectorBenchmark` [2] , the compilation log is as follows:
>>
>>
>> 0d2 B7: # out( B12 B8 ) <- in( B11 B6 ) Freq: 1
>> .
>> .
>> .
>> 0ec vloadcon V3 # generate iota indices
>>
>>
>> At the same time, the following assembly code will be generated when running the `intIndexVector` case:
>>
>> 0x00000040144294ac: .4byte 0x10072d7
>> 0x00000040144294b0: .4byte 0x5208a1d7
>>
>> `0x10072d7/0x5208a1d7` are the machine code for `vsetvli/vid.v`. When running the `floatIndexVector` case, there will be one more instruction than `intIndexVector`:
>>
>> 0x000000401443cc9c: .4byte 0x10072d7
>> 0x000000401443cca0: .4byte 0x5208a157
>> 0x000000401443cca4: .4byte 0x4a219157
>>
>> `0x4a219157` are the machine code for `vfcvt.f.x.v`, which is the instruction generated by `is_floating_point_type(bt)`:
>>
>> if (is_floating_point_type(bt)) {
>> __ vfcvt_f_x_v(as_VectorRegister($dst$$reg), as_VectorRegister($dst$$reg));
>> }
>>
>>
>> After we implement these nodes, by using `-XX:+UseRVV`, the number of assembly instructions is reduced by about ~50% because of the different execution paths with the number of loops, similar to `AddTest` [3].
>>
>> [1] https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc
>> [2] https://github.com/openjdk/jdk/blob/857b0f9b05bc711f3282a0da85fcff131fffab91/test/micro/org/openjdk/bench/jdk/incubator/vector/IndexVectorBenchmark.java
>> [3] https://github.com/zifeihan/vector-api-test-rvv/blob/master/vector-api-rvv-performance.md
>>
>> Please take a look and have some reviews. Thanks a lot.
>>
>> ## Testing:
>>
>> - hotspot and jdk tier1 without new failures (release with UseRVV on QEMU)
>> - test/jdk/jdk/incubator/vector/* (fastdebug/release with UseRVV on QEMU)
>
> src/hotspot/cpu/riscv/riscv_v.ad line 2088:
>
>> 2086: BasicType bt = Matcher::vector_element_basic_type(this);
>> 2087: Assembler::SEW sew = Assembler::elemtype_to_sew(bt);
>> 2088: __ vsetvli(t0, x0, sew);
>
> I heard this opcode ( vsetvli) is pretty costly when the params of vector engine gets reconfigured ( for example for different element width). Not saying anything bad here. We might need to think about some optimisations for using vsetvli in future
> Hi @VladimirKempik , thanks for the review! Almost every instruct in `riscv_v.ad ` uses this opcode ( vsetvli) at the beginning, and it does look like there is a need for optimization. Maybe we can probably discuss it more extensively and change it uniformly.
It's interesting to know how native compilers like GCC/LLVM eliminate such redundances. I guess they face similar issues when do auto-vectorization for different loops within the same function.
-------------
PR: https://git.openjdk.org/jdk/pull/11344
More information about the hotspot-compiler-dev
mailing list