From shade at redhat.com Thu Sep 8 10:33:50 2022 From: shade at redhat.com (Aleksey Shipilev) Date: Thu, 8 Sep 2022 12:33:50 +0200 Subject: RVC by default? Message-ID: Hi, I was looking at some generated code on RISC-V, and realized while we have RVC support, we don't enable it by default. On HiFive Unleashed: $ test-jdk/bin/java -XX:+UnlockExperimentalVMOptions -XX:+PrintFlagsFinal 2>&1 | grep RVC bool UseRVC = false {ARCH experimental} {default} Is there a reason not to do RVC by default? Can we reliably poll the RVC capabilities in current hardware? -- Thanks, -Aleksey From vladimir.kempik at gmail.com Thu Sep 8 10:36:54 2022 From: vladimir.kempik at gmail.com (Vladimir Kempik) Date: Thu, 8 Sep 2022 13:36:54 +0300 Subject: RVC by default? In-Reply-To: References: Message-ID: <0A85EFB1-73C4-46C6-A11A-9439868CB5BD@gmail.com> Hello When doing some benchmarks on risc-v cores, I have observed slight performance decrease when RVC is enabled ( for example in renaissance philosophers) Regards, Vladimir > 8 ????. 2022 ?., ? 13:33, Aleksey Shipilev ???????(?): > > Hi, > > I was looking at some generated code on RISC-V, and realized while we have RVC support, we don't enable it by default. On HiFive Unleashed: > > $ test-jdk/bin/java -XX:+UnlockExperimentalVMOptions -XX:+PrintFlagsFinal 2>&1 | grep RVC > bool UseRVC = false {ARCH experimental} {default} > > > Is there a reason not to do RVC by default? Can we reliably poll the RVC capabilities in current hardware? > > -- > Thanks, > -Aleksey > From yunyao.zxl at alibaba-inc.com Thu Sep 8 12:09:59 2022 From: yunyao.zxl at alibaba-inc.com (Xiaolin Zheng) Date: Thu, 08 Sep 2022 20:09:59 +0800 Subject: =?UTF-8?B?UmU6IFJWQyBieSBkZWZhdWx0Pw==?= In-Reply-To: References: Message-ID: <4d02fa41-4c35-4186-bb14-8eca06f33d12.yunyao.zxl@alibaba-inc.com> Hi Aleksey and Vladimir, The current RVC support is okay but not complete: it only covers ~10% of total instructions emitted (mostly C2 code, including some part of Stub code), and we might want to transform instructions into the compressed counterparts as much as possible, so maybe the design will change from a whitelist mode (the class CompressibleRegion) to a black list mode. There is one implementation at my local branch https://github.com/zhengxiaolinX/jdk/commits/REBASE-rvc-beautify (might not be stable yet, I have not gotten enough time to give it a sufficient test on jtregs and specjbb2015/other benchmarks yet). There are plans reserved to commit them (which cover ~20% of instructions under some tests) after reviewing, but this is currently WIP and waiting loom port to merge first. And thank you Vladimir for your observations, I will test the Renaissance benchmark as you have mentioned. I have given tests for specjbb2015 months before and found slight performance increase there; as far as I know, the compile time will increase for the transformation logic is extra overhead during the instruction emission phase, such as the code in Assembler::add. Theoretically, when running the compiled code with RVC turning on, though IPC and CPI are not changed, the code size shrinks; I think it should have the same effect as the icache size becoming larger. Maybe something goes wrong? :-) I might need to look into the performance problem in a high priority, so will test the Renaissance first. Best, Xiaolin ------------------------------------------------------------------ From:Aleksey Shipilev Send Time:2022?9?8?(???) 18:34 To:undefined ; undefined Subject:RVC by default? Hi, I was looking at some generated code on RISC-V, and realized while we have RVC support, we don't enable it by default. On HiFive Unleashed: $ test-jdk/bin/java -XX:+UnlockExperimentalVMOptions -XX:+PrintFlagsFinal 2>&1 | grep RVC bool UseRVC = false {ARCH experimental} {default} Is there a reason not to do RVC by default? Can we reliably poll the RVC capabilities in current hardware? -- Thanks, -Aleksey -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladimir.kempik at gmail.com Thu Sep 8 12:24:05 2022 From: vladimir.kempik at gmail.com (Vladimir Kempik) Date: Thu, 8 Sep 2022 15:24:05 +0300 Subject: RVC by default? In-Reply-To: <4d02fa41-4c35-4186-bb14-8eca06f33d12.yunyao.zxl@alibaba-inc.com> References: <4d02fa41-4c35-4186-bb14-8eca06f33d12.yunyao.zxl@alibaba-inc.com> Message-ID: <1E6F3C09-4F32-4C75-B445-D16ED06B568E@gmail.com> Hello To be more specific, I saw slight perf decrease with RVC only on a core running on fpga. On thead c910 results ( -RVC and + RVC) are on par. Regards, Vladimir > 8 ????. 2022 ?., ? 15:09, Xiaolin Zheng ???????(?): > > Hi Aleksey and Vladimir, > > The current RVC support is okay but not complete: it only covers ~10% of total instructions emitted (mostly C2 code, including some part of Stub code), and we might want to transform instructions into the compressed counterparts as much as possible, so maybe the design will change from a whitelist mode (the class CompressibleRegion) to a black list mode. There is one implementation at my local branch https://github.com/zhengxiaolinX/jdk/commits/REBASE-rvc-beautify (might not be stable yet, I have not gotten enough time to give it a sufficient test on jtregs and specjbb2015/other benchmarks yet). There are plans reserved to commit them (which cover ~20% of instructions under some tests) after reviewing, but this is currently WIP and waiting loom port to merge first. > > And thank you Vladimir for your observations, I will test the Renaissance benchmark as you have mentioned. I have given tests for specjbb2015 months before and found slight performance increase there; as far as I know, the compile time will increase for the transformation logic is extra overhead during the instruction emission phase, such as the code in Assembler::add. Theoretically, when running the compiled code with RVC turning on, though IPC and CPI are not changed, the code size shrinks; I think it should have the same effect as the icache size becoming larger. Maybe something goes wrong? :-) I might need to look into the performance problem in a high priority, so will test the Renaissance first. > > Best, > Xiaolin > > ------------------------------------------------------------------ > From:Aleksey Shipilev > Send Time:2022?9?8?(???) 18:34 > To:undefined ; undefined > Subject:RVC by default? > > Hi, > > I was looking at some generated code on RISC-V, and realized while we have RVC support, we don't > enable it by default. On HiFive Unleashed: > > $ test-jdk/bin/java -XX:+UnlockExperimentalVMOptions -XX:+PrintFlagsFinal 2>&1 | grep RVC > bool UseRVC = false {ARCH > experimental} {default} > > > Is there a reason not to do RVC by default? Can we reliably poll the RVC capabilities in current > hardware? > > -- > Thanks, > -Aleksey -------------- next part -------------- An HTML attachment was scrubbed... URL: From yunyao.zxl at alibaba-inc.com Thu Sep 8 12:33:01 2022 From: yunyao.zxl at alibaba-inc.com (Xiaolin Zheng) Date: Thu, 08 Sep 2022 20:33:01 +0800 Subject: =?UTF-8?B?UmU6IFJWQyBieSBkZWZhdWx0Pw==?= In-Reply-To: <1E6F3C09-4F32-4C75-B445-D16ED06B568E@gmail.com> References: <4d02fa41-4c35-4186-bb14-8eca06f33d12.yunyao.zxl@alibaba-inc.com>, <1E6F3C09-4F32-4C75-B445-D16ED06B568E@gmail.com> Message-ID: <55ba6296-ba2a-4e38-9d08-4972a162548c.yunyao.zxl@alibaba-inc.com> Hi Vladimir, Thank you for the details. But well... I don't have such a FPGA environment. In my view as a substitution maybe JMH could help us reflect this. What I am sure is the compile time would increase with RVC, and I remember it can be reflected in SPECjbb2005's warehouse1; but in my memory I didn't observe a decrease at the final score. I will run JMH and Renaissance to catch the decrease then. Best, Xiaolin ------------------------------------------------------------------ From:Vladimir Kempik Send Time:2022?9?8?(???) 20:24 To:undefined Cc:undefined ; undefined ; undefined Subject:Re: RVC by default? Hello To be more specific, I saw slight perf decrease with RVC only on a core running on fpga. On thead c910 results ( -RVC and + RVC) are on par. Regards, Vladimir 8 ????. 2022 ?., ? 15:09, Xiaolin Zheng > ???????(?): Hi Aleksey and Vladimir, The current RVC support is okay but not complete: it only covers ~10% of total instructions emitted (mostly C2 code, including some part of Stub code), and we might want to transform instructions into the compressed counterparts as much as possible, so maybe the design will change from a whitelist mode (the class CompressibleRegion) to a black list mode. There is one implementation at my local branch https://github.com/zhengxiaolinX/jdk/commits/REBASE-rvc-beautify (might not be stable yet, I have not gotten enough time to give it a sufficient test on jtregs and specjbb2015/other benchmarks yet). There are plans reserved to commit them (which cover ~20% of instructions under some tests) after reviewing, but this is currently WIP and waiting loom port to merge first. And thank you Vladimir for your observations, I will test the Renaissance benchmark as you have mentioned. I have given tests for specjbb2015 months before and found slight performance increase there; as far as I know, the compile time will increase for the transformation logic is extra overhead during the instruction emission phase, such as the code in Assembler::add. Theoretically, when running the compiled code with RVC turning on, though IPC and CPI are not changed, the code size shrinks; I think it should have the same effect as the icache size becoming larger. Maybe something goes wrong? :-) I might need to look into the performance problem in a high priority, so will test the Renaissance first. Best, Xiaolin ------------------------------------------------------------------ From:Aleksey Shipilev > Send Time:2022?9?8?(???) 18:34 To:undefined ; undefined Subject:RVC by default? Hi, I was looking at some generated code on RISC-V, and realized while we have RVC support, we don't enable it by default. On HiFive Unleashed: $ test-jdk/bin/java -XX:+UnlockExperimentalVMOptions -XX:+PrintFlagsFinal 2>&1 | grep RVC bool UseRVC = false {ARCH experimental} {default} Is there a reason not to do RVC by default? Can we reliably poll the RVC capabilities in current hardware? -- Thanks, -Aleksey -------------- next part -------------- An HTML attachment was scrubbed... URL: From yangfei at iscas.ac.cn Wed Sep 14 07:27:04 2022 From: yangfei at iscas.ac.cn (yangfei at iscas.ac.cn) Date: Wed, 14 Sep 2022 15:27:04 +0800 (GMT+08:00) Subject: RVC by default? In-Reply-To: <4d02fa41-4c35-4186-bb14-8eca06f33d12.yunyao.zxl@alibaba-inc.com> References: <4d02fa41-4c35-4186-bb14-8eca06f33d12.yunyao.zxl@alibaba-inc.com> Message-ID: <4ec8f3c6.13f8.1833ae4ebc1.Coremail.yangfei@iscas.ac.cn> Hi Xiaolin, I am interested in your new proposal for supporting the RVC extension. Can you provide a simple description of how it works and maybe the new interfaces? I guess developers will need to be aware of this when working on this port. Thanks, Fei -----Original Messages----- From:"Xiaolin Zheng" Sent Time:2022-09-08 20:09:59 (Thursday) To: riscv-port-dev , "Aleksey Shipilev" , "riscv-port-dev at openjdk.org" , "Vladimir Kempik" Cc: Subject: Re: RVC by default? Hi Aleksey and Vladimir, The current RVC support is okay but not complete: it only covers ~10% of total instructions emitted (mostly C2 code, including some part of Stub code), and we might want to transform instructions into the compressed counterparts as much as possible, so maybe the design will change from a whitelist mode (the class CompressibleRegion) to a black list mode. There is one implementation at my local branch https://github.com/zhengxiaolinX/jdk/commits/REBASE-rvc-beautify (might not be stable yet, I have not gotten enough time to give it a sufficient test on jtregs and specjbb2015/other benchmarks yet). There are plans reserved to commit them (which cover ~20% of instructions under some tests) after reviewing, but this is currently WIP and waiting loom port to merge first. -------------- next part -------------- An HTML attachment was scrubbed... URL: From yunyao.zxl at alibaba-inc.com Thu Sep 15 02:52:59 2022 From: yunyao.zxl at alibaba-inc.com (Xiaolin Zheng) Date: Thu, 15 Sep 2022 10:52:59 +0800 Subject: =?UTF-8?B?RGlzY3VzcyB0aGUgUlZDIGltcGxlbWVudGF0aW9u?= Message-ID: <2d7bbad2-7ade-4b38-91b5-12c4c0a91602.yunyao.zxl@alibaba-inc.com> Hi team, I am going to describe a different implementation of RVC for our backend. ## Background The RISC-V C extension, also known as RVC, could transform 4-byte instructions to 2-byte counterparts when eligible (for example, as the manual, Rd/Rs of instruction ranges from [x8,x15] might be one common requirement, etc.). ## The current implementation in the Hotspot The current implementation[0] is a transient one, introducing a "CompressibleRegion" by using RTTI[1] to indicate that instructions inside these regions can be safely substituted by the RVC counterparts, if convertible; and the implementation also uses a, say, "whitelist mode" by using the "CompressibleRegion" mentioned above to "manually mark out safe regions", then batch emit them if could. However, after a deeper look, we might discover the current "whitelist mode" has several shortages: ## Shortages of the current implementation 1. Coverages: The current implementation only covers some of C2 match rules, and only some small part of stub code, so there is obviously far more space to reduce the total code size. In my observations, some RISC-V instruction sequences generally occupy a bit more space than AArch64 ones[2]. With the new implementations, we could achieve a code size level alike AArch64's generated code. Some better, some still worse than AArch64 one in my simple observation. 2. Though safe, I'd say it's very much not easy to maintain. The background is, most of the patchable instructions cannot be easily transformed into their shorter counterparts[3], and they need to be prevented from being compressed. So comes the question: we must make sure no patchable relocation is inside the range of a "CompressibleRegion". For example, the string comparison intrinsic function[4] looks very delicious: transforming it and its siblings may result in a yummy compression rate. But programmers might have to check lots of its callees to find if there is just one patchable relocation hidden inside that causes the whole intrinsic incompressible. This could cause extra burden for programmers, so I bet no one would like to add "CompressibleRegion" for his/her code :-) 3. Performance: Better performance of generated code is a little side effect this extension gives us, the smaller the I$ size, the better performance though - please see Andrew Waterman's paper[5] for more reference there. Anyway, it looks like a higher general compression rate is better for performance. The main issue here is the granularity of "CompressibleRegion" is a bit coarse. "Why not exclude the incompressible parts" may come up to us naturally. And after some diggings, we may find: we just need to exclude countable places that would be patched back (mostly relocations), and several code slices with a fixed length, which will be calculated, such as "emit_static_call_stub". All remaining instructions could be safely transformed into RVC counterparts if eligible. So maybe, say, the "blacklist mode"? ## The new implementation To implement the "blacklist mode" in the backend, we need two things: 1. an "IncompressibleRegion", indicating instructions inside it should remain in their normal 4-byte form no matter what happens. 2. a simple strategy to exclude patchable instructions, mainly for relocations. So we can see the new strategy is highly bounded to relocations' positions: We all know the "relocate()" in Hotspot VM is a mark that only has an explicit "start point" without an end point, and some of them could be patched back. Therefore, we can use a simple strategy: introduce a lambda as another argument to assign "end point" semantics to the relocations, for completing our requirements without extra costs. For example: Originally: ``` __ relocate(safepoint_pc.rspec()); __ la(t0, safepoint_pc.target()); __ sd(t0, Address(xthread, JavaThread::saved_exception_pc_offset())); ``` After introducing a simple lambda as an extra argument: ``` __ relocate(safepoint_pc.rspec(), [&] { // The relocate() hides an "IncompressibleRegion" in it __ la(t0, safepoint_pc.target()); // This patchable instruction sequence is incompressible }); _ sd(t0, Address(xthread, JavaThread::saved_exception_pc_offset())); ``` Well, simple but effective. Excluding such countable dynamically patchable places and unifying all relocations, all other instructions can be safely transformed, without messing up the current code style. Programmers could just keep aligning the same style; most of the time they have no need to care about whether the RVC exists or not and things get converted automatically. The proposed new sample code is again, here[6]. ## Other things worth being noticed 1. Instruction patching issues With the C extension, the backend mixes with both 2-byte and 4-byte instructions. It gets a little CISC alike. We know the Hotspot would patch instructions when code is running at full speed, such as call instructions, nops used for deoptimizations (the nops at the entry points, and post-call nops after loom). Instruction patching is delicate so we must carefully handle such places, to keep these 4-byte instructions from spanning cachelines. Though remaining a 4-byte normal form even with RVC, they might sit at a 2-byte aligned boundary. Such cases should definitely not happen, for patching such places spanning cachelines would lose the atomicity. So shortly, we must properly align them, such as [7][8]. Such a problem could exist with RVC, no matter "whitelist mode" or "blacklist mode". It is a general problem for instruction patching. I will add more strong assertions to the potential places (trampoline_call might be a very good spot, for patchable "static_call", "opt_virtual" and "virtual" relocations) to check alignment in the future patches. 2. MachBranch Nodes And MachBranch nodes: they are not easy to be tamed because the "fake label"[9] in PhaseOutput::scratch_emit_size() cannot tell us the real distance of the label. But we can leave them alone in this discussion, for there will be patches to handle those afterward. That's nearly all. Thanks for reaching here despite the verbosity. It would be very nice to receive any suggestions. Best, Xiaolin [0] Original patch: https://github.com/openjdk/riscv-port/pull/34 [1] Of course, the "CompressibleRegion" is good, I like it; and this idea is not from myself. [2] For a simple example, a much commonly used fixed-length movptr() uses up six 4-byte instructions (lui+addi+slli+addi+slli+addi, MIPS alike instructions using arithmetical calculations with signed extensions, but not anyone's fault :-) ), while the AArch64 counterpart only takes three 4-byte instructions (movz+movk+movk). They are both going to mov a 48-bit immediate. After accumulation, the size differs quite a lot. [3] 2-byte instructions have fewer bits, so comes shorter immediate encoding etc. compared to the 4-byte counterparts. After we transform patchable instructions (ones at marks of patchable relocations, etc.) to 2-byte ones, when they are patched to a larger value or farther distances afterward, it is possible that they sadly find themselves, the shorter instructions, cannot cover the newly patched value. So we need to exclude patchable instructions (at the relocation marks etc.) from being compressed. [4] https://github.com/openjdk/jdk/blob/7f3250d71c4866a64eb73f52140c669fe90f122f/src/hotspot/cpu/riscv/riscv.ad#L10032-L10035 [5] https://digitalassets.lib.berkeley.edu/etd/ucb/text/Waterman_berkeley_0028E_15908.pdf , Page 64: "5.4 The RVC Extension, Performance Implications" [6] https://github.com/zhengxiaolinX/jdk/tree/REBASE-rvc-beautify [7] https://github.com/openjdk/jdk/blob/7f3250d71c4866a64eb73f52140c669fe90f122f/src/hotspot/cpu/riscv/riscv.ad#L9873 [8] https://github.com/openjdk/jdk/blob/7f3250d71c4866a64eb73f52140c669fe90f122f/src/hotspot/cpu/riscv/c1_LIRAssembler_riscv.cpp#L1348-L1353 [9] https://github.com/openjdk/jdk/blob/211fab8d361822bbd1a34a88626853bf4a029af5/src/hotspot/share/opto/output.cpp#L3331-L3340 -------------- next part -------------- An HTML attachment was scrubbed... URL: From yunyao.zxl at alibaba-inc.com Thu Sep 15 02:57:19 2022 From: yunyao.zxl at alibaba-inc.com (Xiaolin Zheng) Date: Thu, 15 Sep 2022 10:57:19 +0800 Subject: =?UTF-8?B?UmU6IFJlOiBSVkMgYnkgZGVmYXVsdD8=?= In-Reply-To: <4ec8f3c6.13f8.1833ae4ebc1.Coremail.yangfei@iscas.ac.cn> References: <4d02fa41-4c35-4186-bb14-8eca06f33d12.yunyao.zxl@alibaba-inc.com>, <4ec8f3c6.13f8.1833ae4ebc1.Coremail.yangfei@iscas.ac.cn> Message-ID: <92ac9a5a-e572-434c-b0a8-e1c1460813a3.yunyao.zxl@alibaba-inc.com> Hi Felix, Thank you for mentioning this. Concisely, I have written something about it in a new thread, and we can discuss it there at: https://mail.openjdk.org/pipermail/riscv-port-dev/2022-September/000615.html . There seems confusion about the format, generated from my mailbox. Please forgive him. Best, Xiaolin ------------------------------------------------------------------ From:yangfei Send Time:2022?9?14?(???) 15:33 To:???(??) Cc:riscv-port-dev ; Aleksey Shipilev ; riscv-port-dev at openjdk.org ; Vladimir Kempik Subject:Re: Re: RVC by default? Hi Xiaolin, I am interested in your new proposal for supporting the RVC extension. Can you provide a simple description of how it works and maybe the new interfaces? I guess developers will need to be aware of this when working on this port. Thanks, Fei -----Original Messages----- From:"Xiaolin Zheng" Sent Time:2022-09-08 20:09:59 (Thursday) To: riscv-port-dev , "Aleksey Shipilev" , "riscv-port-dev at openjdk.org" , "Vladimir Kempik" Cc: Subject: Re: RVC by default? Hi Aleksey and Vladimir, The current RVC support is okay but not complete: it only covers ~10% of total instructions emitted (mostly C2 code, including some part of Stub code), and we might want to transform instructions into the compressed counterparts as much as possible, so maybe the design will change from a whitelist mode (the class CompressibleRegion) to a black list mode. There is one implementation at my local branch https://github.com/zhengxiaolinX/jdk/commits/REBASE-rvc-beautify (might not be stable yet, I have not gotten enough time to give it a sufficient test on jtregs and specjbb2015/other benchmarks yet). There are plans reserved to commit them (which cover ~20% of instructions under some tests) after reviewing, but this is currently WIP and waiting loom port to merge first. -------------- next part -------------- An HTML attachment was scrubbed... URL: From yunyao.zxl at alibaba-inc.com Thu Sep 15 03:25:58 2022 From: yunyao.zxl at alibaba-inc.com (Xiaolin Zheng) Date: Thu, 15 Sep 2022 11:25:58 +0800 Subject: =?UTF-8?B?UmU6IFJWQyBieSBkZWZhdWx0Pw==?= In-Reply-To: <1E6F3C09-4F32-4C75-B445-D16ED06B568E@gmail.com> References: <4d02fa41-4c35-4186-bb14-8eca06f33d12.yunyao.zxl@alibaba-inc.com>, <1E6F3C09-4F32-4C75-B445-D16ED06B568E@gmail.com> Message-ID: <2f3de868-7c71-40f3-a73f-478e35eb6a68.yunyao.zxl@alibaba-inc.com> Hi Vladimir, There are some minor updates for the philosophers in Renaissance discussed days before: I have tested the philosophers on my Unmatched board, and found the test itself seems not stable, even if the JMH version. I gave its JMH version a two-day long run, exclusively, but the score varies in the 13000 ms/op range (iterations = 30 by default), even if RVC doesn't get turned on. Have you encountered the same issue? + /home/ubuntu/yunyao/jdk19-release/bin/java -XX:+UnlockExperimentalVMOptions -XX:-UseRVC -jar renaissance-jmh-0.14.1.jar org.renaissance.scala.stm.JmhPhilosophers.runOperation JmhPhilosophers.runOperation ss 40 14307.472 ? 656.456 ms/op JmhPhilosophers.runOperation ss 40 13175.640 ? 303.038 ms/op JmhPhilosophers.runOperation ss 40 13474.124 ? 349.349 ms/op JmhPhilosophers.runOperation ss 40 13545.786 ? 327.735 ms/op JmhPhilosophers.runOperation ss 40 13085.097 ? 306.891 ms/op JmhPhilosophers.runOperation ss 40 12880.270 ? 265.028 ms/op JmhPhilosophers.runOperation ss 40 13232.006 ? 209.613 ms/op JmhPhilosophers.runOperation ss 40 13334.098 ? 443.757 ms/op JmhPhilosophers.runOperation ss 40 13168.990 ? 575.965 ms/op JmhPhilosophers.runOperation ss 40 13424.250 ? 381.084 ms/op JmhPhilosophers.runOperation ss 40 13655.426 ? 428.624 ms/op JmhPhilosophers.runOperation ss 40 14430.485 ? 488.797 ms/op JmhPhilosophers.runOperation ss 40 13999.061 ? 359.320 ms/op JmhPhilosophers.runOperation ss 40 13623.308 ? 531.513 ms/op JmhPhilosophers.runOperation ss 40 13757.331 ? 373.905 ms/op + /home/ubuntu/yunyao/jdk19-release/bin/java -XX:+UnlockExperimentalVMOptions -XX:+UseRVC -jar renaissance-jmh-0.14.1.jar org.renaissance.scala.stm.JmhPhilosophers.runOperation JmhPhilosophers.runOperation ss 40 12772.517 ? 227.409 ms/op JmhPhilosophers.runOperation ss 40 13456.228 ? 498.724 ms/op JmhPhilosophers.runOperation ss 40 13727.211 ? 476.491 ms/op JmhPhilosophers.runOperation ss 40 13122.838 ? 246.673 ms/op JmhPhilosophers.runOperation ss 40 13082.768 ? 405.194 ms/op JmhPhilosophers.runOperation ss 40 13905.753 ? 456.474 ms/op JmhPhilosophers.runOperation ss 40 13503.479 ? 351.191 ms/op JmhPhilosophers.runOperation ss 40 13365.138 ? 380.285 ms/op JmhPhilosophers.runOperation ss 40 13842.509 ? 487.629 ms/op JmhPhilosophers.runOperation ss 40 13965.286 ? 330.423 ms/op JmhPhilosophers.runOperation ss 40 13615.975 ? 352.590 ms/op JmhPhilosophers.runOperation ss 40 13564.777 ? 452.947 ms/op JmhPhilosophers.runOperation ss 40 13720.022 ? 519.965 ms/op JmhPhilosophers.runOperation ss 40 14033.287 ? 404.377 ms/op JmhPhilosophers.runOperation ss 40 13680.432 ? 539.549 ms/op The noise here is a little big; I was wondering if it's stable on the FPGA? Maybe I need to find some more stable tests anyway. Best, Xiaolin ------------------------------------------------------------------ From:Vladimir Kempik Send Time:2022?9?8?(???) 20:24 To:???(??) Cc:riscv-port-dev ; Aleksey Shipilev ; riscv-port-dev at openjdk.org Subject:Re: RVC by default? Hello To be more specific, I saw slight perf decrease with RVC only on a core running on fpga. On thead c910 results ( -RVC and + RVC) are on par. Regards, Vladimir 8 ????. 2022 ?., ? 15:09, Xiaolin Zheng > ???????(?): Hi Aleksey and Vladimir, The current RVC support is okay but not complete: it only covers ~10% of total instructions emitted (mostly C2 code, including some part of Stub code), and we might want to transform instructions into the compressed counterparts as much as possible, so maybe the design will change from a whitelist mode (the class CompressibleRegion) to a black list mode. There is one implementation at my local branch https://github.com/zhengxiaolinX/jdk/commits/REBASE-rvc-beautify (might not be stable yet, I have not gotten enough time to give it a sufficient test on jtregs and specjbb2015/other benchmarks yet). There are plans reserved to commit them (which cover ~20% of instructions under some tests) after reviewing, but this is currently WIP and waiting loom port to merge first. And thank you Vladimir for your observations, I will test the Renaissance benchmark as you have mentioned. I have given tests for specjbb2015 months before and found slight performance increase there; as far as I know, the compile time will increase for the transformation logic is extra overhead during the instruction emission phase, such as the code in Assembler::add. Theoretically, when running the compiled code with RVC turning on, though IPC and CPI are not changed, the code size shrinks; I think it should have the same effect as the icache size becoming larger. Maybe something goes wrong? :-) I might need to look into the performance problem in a high priority, so will test the Renaissance first. Best, Xiaolin ------------------------------------------------------------------ From:Aleksey Shipilev > Send Time:2022?9?8?(???) 18:34 To:undefined ; undefined Subject:RVC by default? Hi, I was looking at some generated code on RISC-V, and realized while we have RVC support, we don't enable it by default. On HiFive Unleashed: $ test-jdk/bin/java -XX:+UnlockExperimentalVMOptions -XX:+PrintFlagsFinal 2>&1 | grep RVC bool UseRVC = false {ARCH experimental} {default} Is there a reason not to do RVC by default? Can we reliably poll the RVC capabilities in current hardware? -- Thanks, -Aleksey -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladimir.kempik at gmail.com Thu Sep 15 12:33:41 2022 From: vladimir.kempik at gmail.com (Vladimir Kempik) Date: Thu, 15 Sep 2022 15:33:41 +0300 Subject: RVC by default? In-Reply-To: <2f3de868-7c71-40f3-a73f-478e35eb6a68.yunyao.zxl@alibaba-inc.com> References: <4d02fa41-4c35-4186-bb14-8eca06f33d12.yunyao.zxl@alibaba-inc.com> <1E6F3C09-4F32-4C75-B445-D16ED06B568E@gmail.com> <2f3de868-7c71-40f3-a73f-478e35eb6a68.yunyao.zxl@alibaba-inc.com> Message-ID: <2FFEBB14-77AA-41AC-89B3-89F66607B66D@gmail.com> Hello Yes, it?s slightly unstable. even on fpga. I have found I can compare results only from two consequential runs ( e.g. first run without RVC, second run with RVC), then some average result from iterations 5-15, removing some too slow results. I think your results shows no perf gain from RVC, that?s expected as RVC gives no perf improvements for opcodes, only requiring less i-cache space. Another interesting moment with RVC, I see some jdk failure only when RVC is enabled and only on fpga. ( on philosophers test) it?s very strange, I will try to debug it and file a bug in JBS if it turns out to be a real jdk bug (or this could easily be a fpga "core" issue) Regards, Vladimir # Native memory allocation (malloc) failed to allocate 4352974235792 bytes for Chunk::new # Out of Memory Error (arena.cpp:184), pid=5722, tid=5723 Stack: [0x0000003f83111000,0x0000003f83311000], sp=0x0000003f8330e2e0, free space=2036k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0xa6c064] VMError::report_and_die(int, char const*, char const*, void*, Thread*, unsigned char*, void*, void*, char const*, int, unsigned long)+0x16a V [libjvm.so+0xa6ca9e] VMError::report_and_die(Thread*, char const*, int, unsigned long, VMErrorType, char const*, void*)+0x28 V [libjvm.so+0x3ff306] report_vm_out_of_memory(char const*, int, unsigned long, VMErrorType, char const*, ...)+0x6a V [libjvm.so+0x2603de] Chunk::operator new(unsigned long, AllocFailStrategy::AllocFailEnum, unsigned long)+0x108 V [libjvm.so+0x260cf2] Arena::grow(unsigned long, AllocFailStrategy::AllocFailEnum)+0x36 V [libjvm.so+0x8d7392] AdapterHandlerLibrary::create_adapter(AdapterBlob*&, int, BasicType*, bool)+0x39e V [libjvm.so+0x8dcb7e] AdapterHandlerLibrary::get_adapter(methodHandle const&)+0x41e J 5 c1 java.util.ImmutableCollections$SetN.probe(Ljava/lang/Object;)I java.base (56 bytes) @ 0x0000003f696e5858 [0x0000003f696e5700+0x0000000000000158] j java.util.ImmutableCollections$SetN.([Ljava/lang/Object;)V+35 java.base j java.util.Set.of([Ljava/lang/Object;)Ljava/util/Set;+64 java.base j jdk.internal.module.SystemModules$default.moduleDescriptors()[Ljava/lang/module/ModuleDescriptor;+3619 java.base j jdk.internal.module.SystemModuleFinders.of(Ljdk/internal/module/SystemModules;)Ljava/lang/module/ModuleFinder;+1 java.base j jdk.internal.module.ModuleBootstrap.boot2()Ljava/lang/ModuleLayer;+240 java.base j jdk.internal.module.ModuleBootstrap.boot()Ljava/lang/ModuleLayer;+64 java.base j java.lang.System.initPhase2(ZZ)I+0 java.base v ~StubRoutines::call_stub 0x0000003f70c1c49c V [libjvm.so+0x5b790c] JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, JavaThread*)+0x1d6 V [libjvm.so+0x5b7b68] JavaCalls::call_static(JavaValue*, Klass*, Symbol*, Symbol*, JavaCallArguments*, JavaThread*)+0xe8 V [libjvm.so+0xa0281c] Threads::create_vm(JavaVMInitArgs*, bool*)+0x63c V [libjvm.so+0x647656] JNI_CreateJavaVM+0x6a C [libjli.so+0x3658] JavaMain+0x7a C [libjli.so+0x670e] ThreadJavaMain+0xc > 15 ????. 2022 ?., ? 06:25, Xiaolin Zheng ???????(?): > > Hi Vladimir, > > There are some minor updates for the philosophers in Renaissance discussed days before: I have tested the philosophers on my Unmatched board, and found the test itself seems not stable, even if the JMH version. I gave its JMH version a two-day long run, exclusively, but the score varies in the 13000 ms/op range (iterations = 30 by default), even if RVC doesn't get turned on. Have you encountered the same issue? > > + /home/ubuntu/yunyao/jdk19-release/bin/java -XX:+UnlockExperimentalVMOptions -XX:-UseRVC -jar renaissance-jmh-0.14.1.jar org.renaissance.scala.stm.JmhPhilosophers.runOperation > JmhPhilosophers.runOperation ss 40 14307.472 ? 656.456 ms/op > JmhPhilosophers.runOperation ss 40 13175.640 ? 303.038 ms/op > JmhPhilosophers.runOperation ss 40 13474.124 ? 349.349 ms/op > JmhPhilosophers.runOperation ss 40 13545.786 ? 327.735 ms/op > JmhPhilosophers.runOperation ss 40 13085.097 ? 306.891 ms/op > JmhPhilosophers.runOperation ss 40 12880.270 ? 265.028 ms/op > JmhPhilosophers.runOperation ss 40 13232.006 ? 209.613 ms/op > JmhPhilosophers.runOperation ss 40 13334.098 ? 443.757 ms/op > JmhPhilosophers.runOperation ss 40 13168.990 ? 575.965 ms/op > JmhPhilosophers.runOperation ss 40 13424.250 ? 381.084 ms/op > JmhPhilosophers.runOperation ss 40 13655.426 ? 428.624 ms/op > JmhPhilosophers.runOperation ss 40 14430.485 ? 488.797 ms/op > JmhPhilosophers.runOperation ss 40 13999.061 ? 359.320 ms/op > JmhPhilosophers.runOperation ss 40 13623.308 ? 531.513 ms/op > JmhPhilosophers.runOperation ss 40 13757.331 ? 373.905 ms/op > > + /home/ubuntu/yunyao/jdk19-release/bin/java -XX:+UnlockExperimentalVMOptions -XX:+UseRVC -jar renaissance-jmh-0.14.1.jar org.renaissance.scala.stm.JmhPhilosophers.runOperation > JmhPhilosophers.runOperation ss 40 12772.517 ? 227.409 ms/op > JmhPhilosophers.runOperation ss 40 13456.228 ? 498.724 ms/op > JmhPhilosophers.runOperation ss 40 13727.211 ? 476.491 ms/op > JmhPhilosophers.runOperation ss 40 13122.838 ? 246.673 ms/op > JmhPhilosophers.runOperation ss 40 13082.768 ? 405.194 ms/op > JmhPhilosophers.runOperation ss 40 13905.753 ? 456.474 ms/op > JmhPhilosophers.runOperation ss 40 13503.479 ? 351.191 ms/op > JmhPhilosophers.runOperation ss 40 13365.138 ? 380.285 ms/op > JmhPhilosophers.runOperation ss 40 13842.509 ? 487.629 ms/op > JmhPhilosophers.runOperation ss 40 13965.286 ? 330.423 ms/op > JmhPhilosophers.runOperation ss 40 13615.975 ? 352.590 ms/op > JmhPhilosophers.runOperation ss 40 13564.777 ? 452.947 ms/op > JmhPhilosophers.runOperation ss 40 13720.022 ? 519.965 ms/op > JmhPhilosophers.runOperation ss 40 14033.287 ? 404.377 ms/op > JmhPhilosophers.runOperation ss 40 13680.432 ? 539.549 ms/op > > The noise here is a little big; I was wondering if it's stable on the FPGA? > > Maybe I need to find some more stable tests anyway. > > > > Best, > Xiaolin > > ------------------------------------------------------------------ > From:Vladimir Kempik > Send Time:2022?9?8?(???) 20:24 > To:???(??) > Cc:riscv-port-dev ; Aleksey Shipilev ; riscv-port-dev at openjdk.org > Subject:Re: RVC by default? > > Hello > To be more specific, I saw slight perf decrease with RVC only on a core running on fpga. > On thead c910 results ( -RVC and + RVC) are on par. > > Regards, Vladimir > > 8 ????. 2022 ?., ? 15:09, Xiaolin Zheng > ???????(?): > > Hi Aleksey and Vladimir, > > The current RVC support is okay but not complete: it only covers ~10% of total instructions emitted (mostly C2 code, including some part of Stub code), and we might want to transform instructions into the compressed counterparts as much as possible, so maybe the design will change from a whitelist mode (the class CompressibleRegion) to a black list mode. There is one implementation at my local branch https://github.com/zhengxiaolinX/jdk/commits/REBASE-rvc-beautify (might not be stable yet, I have not gotten enough time to give it a sufficient test on jtregs and specjbb2015/other benchmarks yet). There are plans reserved to commit them (which cover ~20% of instructions under some tests) after reviewing, but this is currently WIP and waiting loom port to merge first. > > And thank you Vladimir for your observations, I will test the Renaissance benchmark as you have mentioned. I have given tests for specjbb2015 months before and found slight performance increase there; as far as I know, the compile time will increase for the transformation logic is extra overhead during the instruction emission phase, such as the code in Assembler::add. Theoretically, when running the compiled code with RVC turning on, though IPC and CPI are not changed, the code size shrinks; I think it should have the same effect as the icache size becoming larger. Maybe something goes wrong? :-) I might need to look into the performance problem in a high priority, so will test the Renaissance first. > > Best, > Xiaolin > > ------------------------------------------------------------------ > From:Aleksey Shipilev > > Send Time:2022?9?8?(???) 18:34 > To:undefined ; undefined > Subject:RVC by default? > > Hi, > > I was looking at some generated code on RISC-V, and realized while we have RVC support, we don't > enable it by default. On HiFive Unleashed: > > $ test-jdk/bin/java -XX:+UnlockExperimentalVMOptions -XX:+PrintFlagsFinal 2>&1 | grep RVC > bool UseRVC = false {ARCH > experimental} {default} > > > Is there a reason not to do RVC by default? Can we reliably poll the RVC capabilities in current > hardware? > > -- > Thanks, > -Aleksey > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yunyao.zxl at alibaba-inc.com Thu Sep 15 13:01:24 2022 From: yunyao.zxl at alibaba-inc.com (Xiaolin Zheng) Date: Thu, 15 Sep 2022 21:01:24 +0800 Subject: =?UTF-8?B?UmU6IFJWQyBieSBkZWZhdWx0Pw==?= In-Reply-To: <2FFEBB14-77AA-41AC-89B3-89F66607B66D@gmail.com> References: <4d02fa41-4c35-4186-bb14-8eca06f33d12.yunyao.zxl@alibaba-inc.com> <1E6F3C09-4F32-4C75-B445-D16ED06B568E@gmail.com> <2f3de868-7c71-40f3-a73f-478e35eb6a68.yunyao.zxl@alibaba-inc.com>, <2FFEBB14-77AA-41AC-89B3-89F66607B66D@gmail.com> Message-ID: Hi Vladimir, Thank you for the information. RVC's performance gain is a side effect alike thing, and it seems the larger the icache size, the less performance gain of it. Besides, the current RVC implement in the backend is only a basic one, covering some of C2 match rules, far from complete. So I might not assume observing performance gain with the current RVC implementation, but we should also not observe regressions of generated code here. So of course I'd agree with your analysis. The second one seems interesting as well. Weird, it seems a common native out of memory, so shouldn't turning off RVC reveal the same issue, I guess? I will wait for the JBS issue and do some JVM options tuning to simulate that case to see if I can reproduce it in the meantime. Best, Xiaolin ------------------------------------------------------------------ From:Vladimir Kempik Send Time:2022?9?15?(???) 20:33 To:???(??) Cc:riscv-port-dev ; Aleksey Shipilev ; riscv-port-dev at openjdk.org Subject:Re: RVC by default? Hello Yes, it?s slightly unstable. even on fpga. I have found I can compare results only from two consequential runs ( e.g. first run without RVC, second run with RVC), then some average result from iterations 5-15, removing some too slow results. I think your results shows no perf gain from RVC, that?s expected as RVC gives no perf improvements for opcodes, only requiring less i-cache space. Another interesting moment with RVC, I see some jdk failure only when RVC is enabled and only on fpga. ( on philosophers test) it?s very strange, I will try to debug it and file a bug in JBS if it turns out to be a real jdk bug (or this could easily be a fpga "core" issue) Regards, Vladimir # Native memory allocation (malloc) failed to allocate 4352974235792 bytes for Chunk::new # Out of Memory Error (arena.cpp:184), pid=5722, tid=5723 Stack: [0x0000003f83111000,0x0000003f83311000], sp=0x0000003f8330e2e0, free space=2036k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0xa6c064] VMError::report_and_die(int, char const*, char const*, void*, Thread*, unsigned char*, void*, void*, char const*, int, unsigned long)+0x16a V [libjvm.so+0xa6ca9e] VMError::report_and_die(Thread*, char const*, int, unsigned long, VMErrorType, char const*, void*)+0x28 V [libjvm.so+0x3ff306] report_vm_out_of_memory(char const*, int, unsigned long, VMErrorType, char const*, ...)+0x6a V [libjvm.so+0x2603de] Chunk::operator new(unsigned long, AllocFailStrategy::AllocFailEnum, unsigned long)+0x108 V [libjvm.so+0x260cf2] Arena::grow(unsigned long, AllocFailStrategy::AllocFailEnum)+0x36 V [libjvm.so+0x8d7392] AdapterHandlerLibrary::create_adapter(AdapterBlob*&, int, BasicType*, bool)+0x39e V [libjvm.so+0x8dcb7e] AdapterHandlerLibrary::get_adapter(methodHandle const&)+0x41e J 5 c1 java.util.ImmutableCollections$SetN.probe(Ljava/lang/Object;)I java.base (56 bytes) @ 0x0000003f696e5858 [0x0000003f696e5700+0x0000000000000158] j java.util.ImmutableCollections$SetN.([Ljava/lang/Object;)V+35 java.base j java.util.Set.of([Ljava/lang/Object;)Ljava/util/Set;+64 java.base j jdk.internal.module.SystemModules$default.moduleDescriptors()[Ljava/lang/module/ModuleDescriptor;+3619 java.base j jdk.internal.module.SystemModuleFinders.of(Ljdk/internal/module/SystemModules;)Ljava/lang/module/ModuleFinder;+1 java.base j jdk.internal.module.ModuleBootstrap.boot2()Ljava/lang/ModuleLayer;+240 java.base j jdk.internal.module.ModuleBootstrap.boot()Ljava/lang/ModuleLayer;+64 java.base j java.lang.System.initPhase2(ZZ)I+0 java.base v ~StubRoutines::call_stub 0x0000003f70c1c49c V [libjvm.so+0x5b790c] JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, JavaThread*)+0x1d6 V [libjvm.so+0x5b7b68] JavaCalls::call_static(JavaValue*, Klass*, Symbol*, Symbol*, JavaCallArguments*, JavaThread*)+0xe8 V [libjvm.so+0xa0281c] Threads::create_vm(JavaVMInitArgs*, bool*)+0x63c V [libjvm.so+0x647656] JNI_CreateJavaVM+0x6a C [libjli.so+0x3658] JavaMain+0x7a C [libjli.so+0x670e] ThreadJavaMain+0xc 15 ????. 2022 ?., ? 06:25, Xiaolin Zheng > ???????(?): Hi Vladimir, There are some minor updates for the philosophers in Renaissance discussed days before: I have tested the philosophers on my Unmatched board, and found the test itself seems not stable, even if the JMH version. I gave its JMH version a two-day long run, exclusively, but the score varies in the 13000 ms/op range (iterations = 30 by default), even if RVC doesn't get turned on. Have you encountered the same issue? + /home/ubuntu/yunyao/jdk19-release/bin/java -XX:+UnlockExperimentalVMOptions -XX:-UseRVC -jar renaissance-jmh-0.14.1.jar org.renaissance.scala.stm.JmhPhilosophers.runOperation JmhPhilosophers.runOperation ss 40 14307.472 ? 656.456 ms/op JmhPhilosophers.runOperation ss 40 13175.640 ? 303.038 ms/op JmhPhilosophers.runOperation ss 40 13474.124 ? 349.349 ms/op JmhPhilosophers.runOperation ss 40 13545.786 ? 327.735 ms/op JmhPhilosophers.runOperation ss 40 13085.097 ? 306.891 ms/op JmhPhilosophers.runOperation ss 40 12880.270 ? 265.028 ms/op JmhPhilosophers.runOperation ss 40 13232.006 ? 209.613 ms/op JmhPhilosophers.runOperation ss 40 13334.098 ? 443.757 ms/op JmhPhilosophers.runOperation ss 40 13168.990 ? 575.965 ms/op JmhPhilosophers.runOperation ss 40 13424.250 ? 381.084 ms/op JmhPhilosophers.runOperation ss 40 13655.426 ? 428.624 ms/op JmhPhilosophers.runOperation ss 40 14430.485 ? 488.797 ms/op JmhPhilosophers.runOperation ss 40 13999.061 ? 359.320 ms/op JmhPhilosophers.runOperation ss 40 13623.308 ? 531.513 ms/op JmhPhilosophers.runOperation ss 40 13757.331 ? 373.905 ms/op + /home/ubuntu/yunyao/jdk19-release/bin/java -XX:+UnlockExperimentalVMOptions -XX:+UseRVC -jar renaissance-jmh-0.14.1.jar org.renaissance.scala.stm.JmhPhilosophers.runOperation JmhPhilosophers.runOperation ss 40 12772.517 ? 227.409 ms/op JmhPhilosophers.runOperation ss 40 13456.228 ? 498.724 ms/op JmhPhilosophers.runOperation ss 40 13727.211 ? 476.491 ms/op JmhPhilosophers.runOperation ss 40 13122.838 ? 246.673 ms/op JmhPhilosophers.runOperation ss 40 13082.768 ? 405.194 ms/op JmhPhilosophers.runOperation ss 40 13905.753 ? 456.474 ms/op JmhPhilosophers.runOperation ss 40 13503.479 ? 351.191 ms/op JmhPhilosophers.runOperation ss 40 13365.138 ? 380.285 ms/op JmhPhilosophers.runOperation ss 40 13842.509 ? 487.629 ms/op JmhPhilosophers.runOperation ss 40 13965.286 ? 330.423 ms/op JmhPhilosophers.runOperation ss 40 13615.975 ? 352.590 ms/op JmhPhilosophers.runOperation ss 40 13564.777 ? 452.947 ms/op JmhPhilosophers.runOperation ss 40 13720.022 ? 519.965 ms/op JmhPhilosophers.runOperation ss 40 14033.287 ? 404.377 ms/op JmhPhilosophers.runOperation ss 40 13680.432 ? 539.549 ms/op The noise here is a little big; I was wondering if it's stable on the FPGA? Maybe I need to find some more stable tests anyway. Best, Xiaolin ------------------------------------------------------------------ From:Vladimir Kempik > Send Time:2022?9?8?(???) 20:24 To:???(??) > Cc:riscv-port-dev >; Aleksey Shipilev >; riscv-port-dev at openjdk.org > Subject:Re: RVC by default? Hello To be more specific, I saw slight perf decrease with RVC only on a core running on fpga. On thead c910 results ( -RVC and + RVC) are on par. Regards, Vladimir 8 ????. 2022 ?., ? 15:09, Xiaolin Zheng > ???????(?): Hi Aleksey and Vladimir, The current RVC support is okay but not complete: it only covers ~10% of total instructions emitted (mostly C2 code, including some part of Stub code), and we might want to transform instructions into the compressed counterparts as much as possible, so maybe the design will change from a whitelist mode (the class CompressibleRegion) to a black list mode. There is one implementation at my local branch https://github.com/zhengxiaolinX/jdk/commits/REBASE-rvc-beautify (might not be stable yet, I have not gotten enough time to give it a sufficient test on jtregs and specjbb2015/other benchmarks yet). There are plans reserved to commit them (which cover ~20% of instructions under some tests) after reviewing, but this is currently WIP and waiting loom port to merge first. And thank you Vladimir for your observations, I will test the Renaissance benchmark as you have mentioned. I have given tests for specjbb2015 months before and found slight performance increase there; as far as I know, the compile time will increase for the transformation logic is extra overhead during the instruction emission phase, such as the code in Assembler::add. Theoretically, when running the compiled code with RVC turning on, though IPC and CPI are not changed, the code size shrinks; I think it should have the same effect as the icache size becoming larger. Maybe something goes wrong? :-) I might need to look into the performance problem in a high priority, so will test the Renaissance first. Best, Xiaolin ------------------------------------------------------------------ From:Aleksey Shipilev > Send Time:2022?9?8?(???) 18:34 To:undefined ; undefined Subject:RVC by default? Hi, I was looking at some generated code on RISC-V, and realized while we have RVC support, we don't enable it by default. On HiFive Unleashed: $ test-jdk/bin/java -XX:+UnlockExperimentalVMOptions -XX:+PrintFlagsFinal 2>&1 | grep RVC bool UseRVC = false {ARCH experimental} {default} Is there a reason not to do RVC by default? Can we reliably poll the RVC capabilities in current hardware? -- Thanks, -Aleksey -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladimir.kempik at gmail.com Thu Sep 15 15:25:06 2022 From: vladimir.kempik at gmail.com (Vladimir Kempik) Date: Thu, 15 Sep 2022 18:25:06 +0300 Subject: RVC by default? In-Reply-To: References: <4d02fa41-4c35-4186-bb14-8eca06f33d12.yunyao.zxl@alibaba-inc.com> <1E6F3C09-4F32-4C75-B445-D16ED06B568E@gmail.com> <2f3de868-7c71-40f3-a73f-478e35eb6a68.yunyao.zxl@alibaba-inc.com> <2FFEBB14-77AA-41AC-89B3-89F66607B66D@gmail.com> <30868F59-B23D-438B-BF60-9C124520BC15@gmail.com> Message-ID: <8A383140-8BA7-41CF-9822-1A7933EA6212@gmail.com> Hello Looks pretty similar to me. for me it was vanilla recent jdk19 But later, when I backported next patches to my jdk19 branch, the issue became different ( Arena alloc issue I have reported earlier): 8290496: riscv: Fix build warnings-as-errors with GCC 11 8290280: riscv: Clean up stack and register handling in interpreter 8290137: riscv: small refactoring for add_memory_int32/64 8290164: compiler/runtime/TestConstantsInError.java fails on riscv 8291952: riscv: Remove PRAGMA_NONNULL_IGNORED 8291947: riscv: fail to build after JDK-8290840 8291893: riscv: remove fence.i used in user space Backport-of:... 8292713: Unsafe.allocateInstance should be intrinsified without UseUnalignedAccesses 8292867: RISC-V: Simplify weak CAS return value handling 8292407: Improve Weak CAS VarHandle/Unsafe tests resilience under spurious failures 8293100: RISC-V: Need to save and restore callee-saved FloatRegisters in... 8293050: RISC-V: Remove redundant non-null assertions about macro-assembler 8293474: RISC-V: Unify the way of moving function pointer 8293524: RISC-V: Use macro-assembler functions as appropriate 8293566: RISC-V: Clean up push and pop registers I?m gonna bisect this list and find what changed the behaviour. The workaround says - update to ubuntu 21.04, but its not clear - update runtime environment or build environment. For me the runtime is ubuntu 22.04, but I build the jdk with sysroot of ubuntu 20.04 ( for better compatibility) and gcc 11.2 Regards, Vladimir > 15 ????. 2022 ?., ? 18:17, Xiaolin Zheng ???????(?): > > Hi Vladimir, > > The mailing list says my e-mail exceeds 40KB so I get rejected. But I want to send it out anyway before getting off today's work. So here is a work around: > Please check: > https://gist.github.com/zhengxiaolinX/25c32853690f7ac1c125d2fe1da19710 > > Looking forward to your opinions. > > Best, > Xiaolin -------------- next part -------------- An HTML attachment was scrubbed... URL: From yunyao.zxl at alibaba-inc.com Thu Sep 15 15:37:06 2022 From: yunyao.zxl at alibaba-inc.com (Xiaolin Zheng) Date: Thu, 15 Sep 2022 23:37:06 +0800 Subject: =?UTF-8?B?UmU6IFJWQyBieSBkZWZhdWx0Pw==?= In-Reply-To: <8A383140-8BA7-41CF-9822-1A7933EA6212@gmail.com> References: <4d02fa41-4c35-4186-bb14-8eca06f33d12.yunyao.zxl@alibaba-inc.com> <1E6F3C09-4F32-4C75-B445-D16ED06B568E@gmail.com> <2f3de868-7c71-40f3-a73f-478e35eb6a68.yunyao.zxl@alibaba-inc.com> <2FFEBB14-77AA-41AC-89B3-89F66607B66D@gmail.com> <30868F59-B23D-438B-BF60-9C124520BC15@gmail.com> , <8A383140-8BA7-41CF-9822-1A7933EA6212@gmail.com> Message-ID: <4c5c9eca-cc5a-44a5-b9d2-087f51355cc0.yunyao.zxl@alibaba-inc.com> Hi Vladimir, Haha, thanks for the confirmation. Though I didn't managed to send it to the riscv-port-dev at last, I began to realize maybe I have sent several e-mails to you and Aleksey. Very much sorry for bothering you, please forgive me. At that time, what I know is they updated the OS to 21.04 and the problem's gone. I will confirm with them tomorrow. Best, Xiaolin ------------------------------------------------------------------ From:Vladimir Kempik Send Time:2022?9?15?(???) 23:25 To:???(??) Cc:riscv-port-dev ; Aleksey Shipilev ; riscv-port-dev at openjdk.org Subject:Re: RVC by default? Hello Looks pretty similar to me. for me it was vanilla recent jdk19 But later, when I backported next patches to my jdk19 branch, the issue became different ( Arena alloc issue I have reported earlier): 8290496: riscv: Fix build warnings-as-errors with GCC 11 8290280: riscv: Clean up stack and register handling in interpreter 8290137: riscv: small refactoring for add_memory_int32/64 8290164: compiler/runtime/TestConstantsInError.java fails on riscv 8291952: riscv: Remove PRAGMA_NONNULL_IGNORED 8291947: riscv: fail to build after JDK-8290840 8291893: riscv: remove fence.i used in user space Backport-of:... 8292713: Unsafe.allocateInstance should be intrinsified without UseUnalignedAccesses 8292867: RISC-V: Simplify weak CAS return value handling 8292407: Improve Weak CAS VarHandle/Unsafe tests resilience under spurious failures 8293100: RISC-V: Need to save and restore callee-saved FloatRegisters in... 8293050: RISC-V: Remove redundant non-null assertions about macro-assembler 8293474: RISC-V: Unify the way of moving function pointer 8293524: RISC-V: Use macro-assembler functions as appropriate 8293566: RISC-V: Clean up push and pop registers I?m gonna bisect this list and find what changed the behaviour. The workaround says - update to ubuntu 21.04, but its not clear - update runtime environment or build environment. For me the runtime is ubuntu 22.04, but I build the jdk with sysroot of ubuntu 20.04 ( for better compatibility) and gcc 11.2 Regards, Vladimir 15 ????. 2022 ?., ? 18:17, Xiaolin Zheng > ???????(?): Hi Vladimir, The mailing list says my e-mail exceeds 40KB so I get rejected. But I want to send it out anyway before getting off today's work. So here is a work around: Please check: https://gist.github.com/zhengxiaolinX/25c32853690f7ac1c125d2fe1da19710 Looking forward to your opinions. Best, Xiaolin -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladimir.kempik at gmail.com Thu Sep 15 14:17:41 2022 From: vladimir.kempik at gmail.com (Vladimir Kempik) Date: Thu, 15 Sep 2022 17:17:41 +0300 Subject: RVC by default? In-Reply-To: References: <4d02fa41-4c35-4186-bb14-8eca06f33d12.yunyao.zxl@alibaba-inc.com> <1E6F3C09-4F32-4C75-B445-D16ED06B568E@gmail.com> <2f3de868-7c71-40f3-a73f-478e35eb6a68.yunyao.zxl@alibaba-inc.com> <2FFEBB14-77AA-41AC-89B3-89F66607B66D@gmail.com> Message-ID: <30868F59-B23D-438B-BF60-9C124520BC15@gmail.com> The main thing there is the amount of to be allocated memory, clearly it?s not normal. Here is another bug with RVC which again happens only on fpga, but stable and I can see wrong code generation, could be interesting to you: # SIGILL (0x4) at pc=0x0000003f94507700, pid=363, tid=364 Stack: [0x0000003f939b0000,0x0000003f93bb0000], sp=0x0000003f93bad9a0, free space=2038k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x8dc700] SharedRuntime::look_for_reserved_stack_annotated_method(JavaThread*, frame)+0x37c J 2 c1 java.lang.String.hashCode()I java.base (60 bytes) @ 0x0000003f7d000874 [0x0000003f7d000580+0x00000000000002f4] j java.util.ImmutableCollections$SetN.probe(Ljava/lang/Object;)I+1 java.base j java.util.ImmutableCollections$SetN.([Ljava/lang/Object;)V+35 java.base j java.util.Set.of([Ljava/lang/Object;)Ljava/util/Set;+64 java.base j jdk.internal.module.SystemModules$default.moduleDescriptors()[Ljava/lang/module/ModuleDescriptor;+1890 java.base j jdk.internal.module.SystemModuleFinders.of(Ljdk/internal/module/SystemModules;)Ljava/lang/module/ModuleFinder;+1 java.base j jdk.internal.module.ModuleBootstrap.boot2()Ljava/lang/ModuleLayer;+240 java.base j jdk.internal.module.ModuleBootstrap.boot()Ljava/lang/ModuleLayer;+64 java.base j java.lang.System.initPhase2(ZZ)I+0 java.base v ~StubRoutines::call_stub 0x0000003f8453849c V [libjvm.so+0x5b8018] JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, JavaThread*)+0x1d6 V [libjvm.so+0x5b8274] JavaCalls::call_static(JavaValue*, Klass*, Symbol*, Symbol*, JavaCallArguments*, JavaThread*)+0xe8 V [libjvm.so+0xa0320c] Threads::create_vm(JavaVMInitArgs*, bool*)+0x63c V [libjvm.so+0x647d6c] JNI_CreateJavaVM+0x6a C [libjli.so+0x3658] JavaMain+0x7a C [libjli.so+0x670e] ThreadJavaMain+0xc C [libc.so.6+0x675a6] C [libc.so.6+0xb3a02] pc =0x0000003f94507700 x1(ra) =0x0000003f8458af2a is at code_begin+170 in [CodeBlob (0x0000003f8458ae10)] Framesize: 62 Runtime Stub (0x0000003f8458ae10): resolve_static_call -------------------------------------------------------------------------------- Decoding CodeBlob, name: resolve_static_call, at [0x0000003f8458ae80, 0x0000003f8458b078] 504 bytes 0x0000003f8458ae80: addi sp,sp,-16 0x0000003f8458ae84: sd ra,8(sp) 0x0000003f8458ae88: sd s0,0(sp) 0x0000003f8458ae8c: addi s0,sp,16 0x0000003f8458ae90: addi sp,sp,-224 0x0000003f8458ae92: sd t0,8(sp) ?.. 0x0000003f8458af08: fsd ft11,248(sp) 0x0000003f8458af0a: auipc t0,0x0 0x0000003f8458af0e: addi t0,t0,32 # 0x0000003f8458af2a 0x0000003f8458af12: sd t0,712(s7) 0x0000003f8458af16: mv t0,sp 0x0000003f8458af1a: sd t0,704(s7) 0x0000003f8458af1e: mv a0,s7 0x0000003f8458af22: auipc t0,0xff7c 0x0000003f8458af26: jalr 2014(t0) # 0x0000003f94507700 = SharedRuntime::look_for_reserved_stack_annotated_method(JavaThread*, frame)+892 0x0000003f8458af2a: sd zero,704(s7) 0x0000003f8458af2e: sd zero,712(s7) 0x0000003f8458af32: ld t0,8(s7) 0x0000003f8458af36: bnez t0,0x0000003f8458afd8 0x0000003f8458af3a: ld t6,800(s7) 0x0000003f8458af3e: sd zero,800(s7) 0x0000003f8458af42: sd t6,472(sp) 0x0000003f8458af46: sd a0,264(sp) the interesting line here is 0x0000003f8458af26: jalr 2014(t0) # 0x0000003f94507700 = SharedRuntime::look_for_reserved_stack_annotated_method(JavaThread*, frame)+892 +892 offset is 0x37c in hex, it?s exactly our crash site as backtrace says ( aka pc = 0x0000003f94507700) , lets check what?s there 0x0000003f945076e0: c9 02 23 34 d9 02 23 38 f9 02 ef 60 30 ff 7d 77 0x0000003f945076f0: 93 07 87 70 a2 97 90 63 aa 85 17 65 28 00 13 05 0x0000003f94507700: 65 9e 97 30 b2 ff e7 80 20 05 7d 77 13 07 87 70 0x0000003f94507710: 17 ac 36 00 03 3c 0c 5e 22 97 18 63 83 47 1c 1f 0x3f945076e0: 02c9 addi t0,t0,18 0x3f945076e2: 02d93423 sd a3,40(s2) 0x3f945076e6: 02f93823 sd a5,48(s2) 0x3f945076ea: ff3060ef jal ra,-1019918 # 0x3f9440e6dc 0x3f945076ee: 777d lui a4,-4096 0x3f945076f0: 70870793 addi a5,a4,1800 0x3f945076f4: 97a2 add a5,a5,s0 0x3f945076f6: 6390 ld a2,0(a5) 0x3f945076f8: 85aa mv a1,a0 0x3f945076fa: 00286517 auipc a0,2646016 # 0x3f9478d6fa 0x3f945076fe: 9e650513 addi a0,a0,-1562 0x3f94507702: ffb23097 auipc ra,-5099520 # 0x3f9402a702 0x3f94507706: 052080e7 jalr ra,ra,82 0x3f9450770a: 777d lui a4,-4096 0x3f9450770c: 70870713 addi a4,a4,1800 0x3f94507710: 0036ac17 auipc s8,3579904 # 0x3f94871710 0x3f94507714: 5e0c3c03 ld s8,1504(s8) 0x3f94507718: 9722 add a4,a4,s0 0x3f9450771a: 6318 ld a4,0(a4) 0x3f9450771c: 1f1c4783 lbu a5,497(s8) so, jalr jumped into the middle of opcode 0x3f945076fe: 9e650513 addi a0,a0,-1562 So this could be an issue with runtime blob generation. Regards, Vladimir > 15 ????. 2022 ?., ? 16:01, Xiaolin Zheng ???????(?): > > Hi Vladimir, > > Thank you for the information. RVC's performance gain is a side effect alike thing, and it seems the larger the icache size, the less performance gain of it. Besides, the current RVC implement in the backend is only a basic one, covering some of C2 match rules, far from complete. So I might not assume observing performance gain with the current RVC implementation, but we should also not observe regressions of generated code here. So of course I'd agree with your analysis. > > The second one seems interesting as well. Weird, it seems a common native out of memory, so shouldn't turning off RVC reveal the same issue, I guess? I will wait for the JBS issue and do some JVM options tuning to simulate that case to see if I can reproduce it in the meantime. > > Best, > Xiaolin > > ------------------------------------------------------------------ > From:Vladimir Kempik > Send Time:2022?9?15?(???) 20:33 > To:???(??) > Cc:riscv-port-dev ; Aleksey Shipilev ; riscv-port-dev at openjdk.org > Subject:Re: RVC by default? > > Hello > Yes, it?s slightly unstable. even on fpga. > I have found I can compare results only from two consequential runs ( e.g. first run without RVC, second run with RVC), then some average result from iterations 5-15, removing some too slow results. > I think your results shows no perf gain from RVC, that?s expected as RVC gives no perf improvements for opcodes, only requiring less i-cache space. > > Another interesting moment with RVC, I see some jdk failure only when RVC is enabled and only on fpga. ( on philosophers test) > it?s very strange, I will try to debug it and file a bug in JBS if it turns out to be a real jdk bug (or this could easily be a fpga "core" issue) > > Regards, Vladimir > > # Native memory allocation (malloc) failed to allocate 4352974235792 bytes for Chunk::new > # Out of Memory Error (arena.cpp:184), pid=5722, tid=5723 > Stack: [0x0000003f83111000,0x0000003f83311000], sp=0x0000003f8330e2e0, free space=2036k > Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) > V [libjvm.so+0xa6c064] VMError::report_and_die(int, char const*, char const*, void*, Thread*, unsigned char*, void*, void*, char const*, int, unsigned long)+0x16a > V [libjvm.so+0xa6ca9e] VMError::report_and_die(Thread*, char const*, int, unsigned long, VMErrorType, char const*, void*)+0x28 > V [libjvm.so+0x3ff306] report_vm_out_of_memory(char const*, int, unsigned long, VMErrorType, char const*, ...)+0x6a > V [libjvm.so+0x2603de] Chunk::operator new(unsigned long, AllocFailStrategy::AllocFailEnum, unsigned long)+0x108 > V [libjvm.so+0x260cf2] Arena::grow(unsigned long, AllocFailStrategy::AllocFailEnum)+0x36 > V [libjvm.so+0x8d7392] AdapterHandlerLibrary::create_adapter(AdapterBlob*&, int, BasicType*, bool)+0x39e > V [libjvm.so+0x8dcb7e] AdapterHandlerLibrary::get_adapter(methodHandle const&)+0x41e > J 5 c1 java.util.ImmutableCollections$SetN.probe(Ljava/lang/Object;)I java.base (56 bytes) @ 0x0000003f696e5858 [0x0000003f696e5700+0x0000000000000158] > j java.util.ImmutableCollections$SetN.([Ljava/lang/Object;)V+35 java.base > j java.util.Set.of([Ljava/lang/Object;)Ljava/util/Set;+64 java.base > j jdk.internal.module.SystemModules$default.moduleDescriptors()[Ljava/lang/module/ModuleDescriptor;+3619 java.base > j jdk.internal.module.SystemModuleFinders.of(Ljdk/internal/module/SystemModules;)Ljava/lang/module/ModuleFinder;+1 java.base > j jdk.internal.module.ModuleBootstrap.boot2()Ljava/lang/ModuleLayer;+240 java.base > j jdk.internal.module.ModuleBootstrap.boot()Ljava/lang/ModuleLayer;+64 java.base > j java.lang.System.initPhase2(ZZ)I+0 java.base > v ~StubRoutines::call_stub 0x0000003f70c1c49c > V [libjvm.so+0x5b790c] JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, JavaThread*)+0x1d6 > V [libjvm.so+0x5b7b68] JavaCalls::call_static(JavaValue*, Klass*, Symbol*, Symbol*, JavaCallArguments*, JavaThread*)+0xe8 > V [libjvm.so+0xa0281c] Threads::create_vm(JavaVMInitArgs*, bool*)+0x63c > V [libjvm.so+0x647656] JNI_CreateJavaVM+0x6a > C [libjli.so+0x3658] JavaMain+0x7a > C [libjli.so+0x670e] ThreadJavaMain+0xc > > > 15 ????. 2022 ?., ? 06:25, Xiaolin Zheng > ???????(?): > > Hi Vladimir, > > There are some minor updates for the philosophers in Renaissance discussed days before: I have tested the philosophers on my Unmatched board, and found the test itself seems not stable, even if the JMH version. I gave its JMH version a two-day long run, exclusively, but the score varies in the 13000 ms/op range (iterations = 30 by default), even if RVC doesn't get turned on. Have you encountered the same issue? > > + /home/ubuntu/yunyao/jdk19-release/bin/java -XX:+UnlockExperimentalVMOptions -XX:-UseRVC -jar renaissance-jmh-0.14.1.jar org.renaissance.scala.stm.JmhPhilosophers.runOperation > JmhPhilosophers.runOperation ss 40 14307.472 ? 656.456 ms/op > JmhPhilosophers.runOperation ss 40 13175.640 ? 303.038 ms/op > JmhPhilosophers.runOperation ss 40 13474.124 ? 349.349 ms/op > JmhPhilosophers.runOperation ss 40 13545.786 ? 327.735 ms/op > JmhPhilosophers.runOperation ss 40 13085.097 ? 306.891 ms/op > JmhPhilosophers.runOperation ss 40 12880.270 ? 265.028 ms/op > JmhPhilosophers.runOperation ss 40 13232.006 ? 209.613 ms/op > JmhPhilosophers.runOperation ss 40 13334.098 ? 443.757 ms/op > JmhPhilosophers.runOperation ss 40 13168.990 ? 575.965 ms/op > JmhPhilosophers.runOperation ss 40 13424.250 ? 381.084 ms/op > JmhPhilosophers.runOperation ss 40 13655.426 ? 428.624 ms/op > JmhPhilosophers.runOperation ss 40 14430.485 ? 488.797 ms/op > JmhPhilosophers.runOperation ss 40 13999.061 ? 359.320 ms/op > JmhPhilosophers.runOperation ss 40 13623.308 ? 531.513 ms/op > JmhPhilosophers.runOperation ss 40 13757.331 ? 373.905 ms/op > > + /home/ubuntu/yunyao/jdk19-release/bin/java -XX:+UnlockExperimentalVMOptions -XX:+UseRVC -jar renaissance-jmh-0.14.1.jar org.renaissance.scala.stm.JmhPhilosophers.runOperation > JmhPhilosophers.runOperation ss 40 12772.517 ? 227.409 ms/op > JmhPhilosophers.runOperation ss 40 13456.228 ? 498.724 ms/op > JmhPhilosophers.runOperation ss 40 13727.211 ? 476.491 ms/op > JmhPhilosophers.runOperation ss 40 13122.838 ? 246.673 ms/op > JmhPhilosophers.runOperation ss 40 13082.768 ? 405.194 ms/op > JmhPhilosophers.runOperation ss 40 13905.753 ? 456.474 ms/op > JmhPhilosophers.runOperation ss 40 13503.479 ? 351.191 ms/op > JmhPhilosophers.runOperation ss 40 13365.138 ? 380.285 ms/op > JmhPhilosophers.runOperation ss 40 13842.509 ? 487.629 ms/op > JmhPhilosophers.runOperation ss 40 13965.286 ? 330.423 ms/op > JmhPhilosophers.runOperation ss 40 13615.975 ? 352.590 ms/op > JmhPhilosophers.runOperation ss 40 13564.777 ? 452.947 ms/op > JmhPhilosophers.runOperation ss 40 13720.022 ? 519.965 ms/op > JmhPhilosophers.runOperation ss 40 14033.287 ? 404.377 ms/op > JmhPhilosophers.runOperation ss 40 13680.432 ? 539.549 ms/op > > The noise here is a little big; I was wondering if it's stable on the FPGA? > > Maybe I need to find some more stable tests anyway. > > > > Best, > Xiaolin > > ------------------------------------------------------------------ > From:Vladimir Kempik > > Send Time:2022?9?8?(???) 20:24 > To:???(??) > > Cc:riscv-port-dev >; Aleksey Shipilev >; riscv-port-dev at openjdk.org > > Subject:Re: RVC by default? > > Hello > To be more specific, I saw slight perf decrease with RVC only on a core running on fpga. > On thead c910 results ( -RVC and + RVC) are on par. > > Regards, Vladimir > > 8 ????. 2022 ?., ? 15:09, Xiaolin Zheng > ???????(?): > > Hi Aleksey and Vladimir, > > The current RVC support is okay but not complete: it only covers ~10% of total instructions emitted (mostly C2 code, including some part of Stub code), and we might want to transform instructions into the compressed counterparts as much as possible, so maybe the design will change from a whitelist mode (the class CompressibleRegion) to a black list mode. There is one implementation at my local branch https://github.com/zhengxiaolinX/jdk/commits/REBASE-rvc-beautify (might not be stable yet, I have not gotten enough time to give it a sufficient test on jtregs and specjbb2015/other benchmarks yet). There are plans reserved to commit them (which cover ~20% of instructions under some tests) after reviewing, but this is currently WIP and waiting loom port to merge first. > > And thank you Vladimir for your observations, I will test the Renaissance benchmark as you have mentioned. I have given tests for specjbb2015 months before and found slight performance increase there; as far as I know, the compile time will increase for the transformation logic is extra overhead during the instruction emission phase, such as the code in Assembler::add. Theoretically, when running the compiled code with RVC turning on, though IPC and CPI are not changed, the code size shrinks; I think it should have the same effect as the icache size becoming larger. Maybe something goes wrong? :-) I might need to look into the performance problem in a high priority, so will test the Renaissance first. > > Best, > Xiaolin > > ------------------------------------------------------------------ > From:Aleksey Shipilev > > Send Time:2022?9?8?(???) 18:34 > To:undefined ; undefined > Subject:RVC by default? > > Hi, > > I was looking at some generated code on RISC-V, and realized while we have RVC support, we don't > enable it by default. On HiFive Unleashed: > > $ test-jdk/bin/java -XX:+UnlockExperimentalVMOptions -XX:+PrintFlagsFinal 2>&1 | grep RVC > bool UseRVC = false {ARCH > experimental} {default} > > > Is there a reason not to do RVC by default? Can we reliably poll the RVC capabilities in current > hardware? > > -- > Thanks, > -Aleksey > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yunyao.zxl at alibaba-inc.com Thu Sep 15 15:21:43 2022 From: yunyao.zxl at alibaba-inc.com (Xiaolin Zheng) Date: Thu, 15 Sep 2022 23:21:43 +0800 Subject: =?UTF-8?B?UmU6IFJWQyBieSBkZWZhdWx0Pw==?= In-Reply-To: <30868F59-B23D-438B-BF60-9C124520BC15@gmail.com> References: <4d02fa41-4c35-4186-bb14-8eca06f33d12.yunyao.zxl@alibaba-inc.com> <1E6F3C09-4F32-4C75-B445-D16ED06B568E@gmail.com> <2f3de868-7c71-40f3-a73f-478e35eb6a68.yunyao.zxl@alibaba-inc.com> <2FFEBB14-77AA-41AC-89B3-89F66607B66D@gmail.com> , <30868F59-B23D-438B-BF60-9C124520BC15@gmail.com> Message-ID: Hi Vladimir, Extremely interesting, which allows me remember the https://github.com/riscv-collab/riscv-openjdk/issues/23 issue one year before: the "hs_err_pid4188454.log ". At that time I was developing one RVC prototype; after merge it, we found this issue. We can see it has nearly the same interesting thing here, in a similar form. In that hs_err, ``` > x1(ra) =0x0000003fa9732c66 is at code_begin+470 in ... ``` ra is 0x0000003fa9732c66, so if we search the ra: ``` DeoptimizationBlob -------------------------------------------------------------------------------- Decoding CodeBlob, name: DeoptimizationBlob, at [0x0000003fa9732a90, 0x0000003fa9732d38] 680 bytes 0x0000003fa9732a90: addi sp,sp,-16 0x0000003fa9732a92: sd ra,8(sp) 0x0000003fa9732a94: sd s0,0(sp) 0x0000003fa9732a96: mv s0,sp 0x0000003fa9732a98: addi sp,sp,-240 0x0000003fa9732a9a: sd zero,0(sp) 0x0000003fa9732a9c: sd gp,8(sp) 0x0000003fa9732a9e: sd tp,16(sp) 0x0000003fa9732aa0: sd t0,24(sp) ... ... 0x0000003fa9732c36: fsd ft9,232(sp) 0x0000003fa9732c38: fsd ft10,240(sp) 0x0000003fa9732c3a: fsd ft11,248(sp) 0x0000003fa9732c3c: li s10,1 0x0000003fa9732c3e: ld a3,936(s7) 0x0000003fa9732c42: sd a3,8(s0) 0x0000003fa9732c44: sd zero,936(s7) 0x0000003fa9732c48: auipc t0,0x0 0x0000003fa9732c4c: addi t0,t0,30 # 0x0000003fa9732c66 0x0000003fa9732c50: sd t0,664(s7) 0x0000003fa9732c54: mv t0,sp 0x0000003fa9732c56: sd t0,656(s7) 0x0000003fa9732c5a: mv a0,s7 0x0000003fa9732c5c: mv a1,s10 0x0000003fa9732c5e: auipc t0,0xf2c4 0x0000003fa9732c62: jalr -1722(t0) # 0x0000003fb89f65a4 = DeoptReasonSerializer::serialize(JfrCheckpointWriter&)+220 <- ??????? here, very alike your hs_err right? :-) It's flying away. Of course we are not calling JFR here. 0x0000003fa9732c66: sd zero,656(s7) 0x0000003fa9732c6a: sd zero,664(s7) 0x0000003fa9732c6e: mv a5,a0 0x0000003fa9732c70: lwu s10,60(a5) ``` In your presented hs_err, the flying pc is: ``` 0x0000003f8458af26: jalr 2014(t0) # 0x0000003f94507700 = SharedRuntime::look_for_reserved_stack_annotated_method(JavaThread*, frame)+892 ``` Of course in the resolve_static_call it should definitely not going to a function (SharedRuntime::look_for_reserved_stack_annotated_method) only SIG handler calls. (Specially thanks Feilong for preserving the original issue so I can find it) I have no proof that these two are the same issue anyway, but they look so much similar, which recalls me maybe there has some relationship. The solution at that time is https://github.com/riscv-collab/riscv-openjdk/issues/23#issuecomment-939247481 , from Yadong. I would like to hear from your opinions about this, for you have the reproducible environment. This is an indeed interesting issue. Best, Xiaolin ------------------------------------------------------------------ From:Vladimir Kempik Send Time:2022?9?15?(???) 22:17 To:???(??) Cc:riscv-port-dev ; Aleksey Shipilev ; riscv-port-dev at openjdk.org Subject:Re: RVC by default? The main thing there is the amount of to be allocated memory, clearly it?s not normal. Here is another bug with RVC which again happens only on fpga, but stable and I can see wrong code generation, could be interesting to you: # SIGILL (0x4) at pc=0x0000003f94507700, pid=363, tid=364 Stack: [0x0000003f939b0000,0x0000003f93bb0000], sp=0x0000003f93bad9a0, free space=2038k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x8dc700] SharedRuntime::look_for_reserved_stack_annotated_method(JavaThread*, frame)+0x37c J 2 c1 java.lang.String.hashCode()I java.base (60 bytes) @ 0x0000003f7d000874 [0x0000003f7d000580+0x00000000000002f4] j java.util.ImmutableCollections$SetN.probe(Ljava/lang/Object;)I+1 java.base j java.util.ImmutableCollections$SetN.([Ljava/lang/Object;)V+35 java.base j java.util.Set.of([Ljava/lang/Object;)Ljava/util/Set;+64 java.base j jdk.internal.module.SystemModules$default.moduleDescriptors()[Ljava/lang/module/ModuleDescriptor;+1890 java.base j jdk.internal.module.SystemModuleFinders.of(Ljdk/internal/module/SystemModules;)Ljava/lang/module/ModuleFinder;+1 java.base j jdk.internal.module.ModuleBootstrap.boot2()Ljava/lang/ModuleLayer;+240 java.base j jdk.internal.module.ModuleBootstrap.boot()Ljava/lang/ModuleLayer;+64 java.base j java.lang.System.initPhase2(ZZ)I+0 java.base v ~StubRoutines::call_stub 0x0000003f8453849c V [libjvm.so+0x5b8018] JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, JavaThread*)+0x1d6 V [libjvm.so+0x5b8274] JavaCalls::call_static(JavaValue*, Klass*, Symbol*, Symbol*, JavaCallArguments*, JavaThread*)+0xe8 V [libjvm.so+0xa0320c] Threads::create_vm(JavaVMInitArgs*, bool*)+0x63c V [libjvm.so+0x647d6c] JNI_CreateJavaVM+0x6a C [libjli.so+0x3658] JavaMain+0x7a C [libjli.so+0x670e] ThreadJavaMain+0xc C [libc.so.6+0x675a6] C [libc.so.6+0xb3a02] pc =0x0000003f94507700 x1(ra) =0x0000003f8458af2a is at code_begin+170 in [CodeBlob (0x0000003f8458ae10)] Framesize: 62 Runtime Stub (0x0000003f8458ae10): resolve_static_call -------------------------------------------------------------------------------- Decoding CodeBlob, name: resolve_static_call, at [0x0000003f8458ae80, 0x0000003f8458b078] 504 bytes 0x0000003f8458ae80: addi sp,sp,-16 0x0000003f8458ae84: sd ra,8(sp) 0x0000003f8458ae88: sd s0,0(sp) 0x0000003f8458ae8c: addi s0,sp,16 0x0000003f8458ae90: addi sp,sp,-224 0x0000003f8458ae92: sd t0,8(sp) ?.. 0x0000003f8458af08: fsd ft11,248(sp) 0x0000003f8458af0a: auipc t0,0x0 0x0000003f8458af0e: addi t0,t0,32 # 0x0000003f8458af2a 0x0000003f8458af12: sd t0,712(s7) 0x0000003f8458af16: mv t0,sp 0x0000003f8458af1a: sd t0,704(s7) 0x0000003f8458af1e: mv a0,s7 0x0000003f8458af22: auipc t0,0xff7c 0x0000003f8458af26: jalr 2014(t0) # 0x0000003f94507700 = SharedRuntime::look_for_reserved_stack_annotated_method(JavaThread*, frame)+892 0x0000003f8458af2a: sd zero,704(s7) 0x0000003f8458af2e: sd zero,712(s7) 0x0000003f8458af32: ld t0,8(s7) 0x0000003f8458af36: bnez t0,0x0000003f8458afd8 0x0000003f8458af3a: ld t6,800(s7) 0x0000003f8458af3e: sd zero,800(s7) 0x0000003f8458af42: sd t6,472(sp) 0x0000003f8458af46: sd a0,264(sp) the interesting line here is 0x0000003f8458af26: jalr 2014(t0) # 0x0000003f94507700 = SharedRuntime::look_for_reserved_stack_annotated_method(JavaThread*, frame)+892 +892 offset is 0x37c in hex, it?s exactly our crash site as backtrace says ( aka pc = 0x0000003f94507700) , lets check what?s there 0x0000003f945076e0: c9 02 23 34 d9 02 23 38 f9 02 ef 60 30 ff 7d 77 0x0000003f945076f0: 93 07 87 70 a2 97 90 63 aa 85 17 65 28 00 13 05 0x0000003f94507700: 65 9e 97 30 b2 ff e7 80 20 05 7d 77 13 07 87 70 0x0000003f94507710: 17 ac 36 00 03 3c 0c 5e 22 97 18 63 83 47 1c 1f 0x3f945076e0: 02c9 addi t0,t0,18 0x3f945076e2: 02d93423 sd a3,40(s2) 0x3f945076e6: 02f93823 sd a5,48(s2) 0x3f945076ea: ff3060ef jal ra,-1019918 # 0x3f9440e6dc 0x3f945076ee: 777d lui a4,-4096 0x3f945076f0: 70870793 addi a5,a4,1800 0x3f945076f4: 97a2 add a5,a5,s0 0x3f945076f6: 6390 ld a2,0(a5) 0x3f945076f8: 85aa mv a1,a0 0x3f945076fa: 00286517 auipc a0,2646016 # 0x3f9478d6fa 0x3f945076fe: 9e650513 addi a0,a0,-1562 0x3f94507702: ffb23097 auipc ra,-5099520 # 0x3f9402a702 0x3f94507706: 052080e7 jalr ra,ra,82 0x3f9450770a: 777d lui a4,-4096 0x3f9450770c: 70870713 addi a4,a4,1800 0x3f94507710: 0036ac17 auipc s8,3579904 # 0x3f94871710 0x3f94507714: 5e0c3c03 ld s8,1504(s8) 0x3f94507718: 9722 add a4,a4,s0 0x3f9450771a: 6318 ld a4,0(a4) 0x3f9450771c: 1f1c4783 lbu a5,497(s8) so, jalr jumped into the middle of opcode 0x3f945076fe: 9e650513 addi a0,a0,-1562 So this could be an issue with runtime blob generation. Regards, Vladimir 15 ????. 2022 ?., ? 16:01, Xiaolin Zheng > ???????(?): Hi Vladimir, Thank you for the information. RVC's performance gain is a side effect alike thing, and it seems the larger the icache size, the less performance gain of it. Besides, the current RVC implement in the backend is only a basic one, covering some of C2 match rules, far from complete. So I might not assume observing performance gain with the current RVC implementation, but we should also not observe regressions of generated code here. So of course I'd agree with your analysis. The second one seems interesting as well. Weird, it seems a common native out of memory, so shouldn't turning off RVC reveal the same issue, I guess? I will wait for the JBS issue and do some JVM options tuning to simulate that case to see if I can reproduce it in the meantime. Best, Xiaolin ------------------------------------------------------------------ From:Vladimir Kempik > Send Time:2022?9?15?(???) 20:33 To:???(??) > Cc:riscv-port-dev >; Aleksey Shipilev >; riscv-port-dev at openjdk.org > Subject:Re: RVC by default? Hello Yes, it?s slightly unstable. even on fpga. I have found I can compare results only from two consequential runs ( e.g. first run without RVC, second run with RVC), then some average result from iterations 5-15, removing some too slow results. I think your results shows no perf gain from RVC, that?s expected as RVC gives no perf improvements for opcodes, only requiring less i-cache space. Another interesting moment with RVC, I see some jdk failure only when RVC is enabled and only on fpga. ( on philosophers test) it?s very strange, I will try to debug it and file a bug in JBS if it turns out to be a real jdk bug (or this could easily be a fpga "core" issue) Regards, Vladimir # Native memory allocation (malloc) failed to allocate 4352974235792 bytes for Chunk::new # Out of Memory Error (arena.cpp:184), pid=5722, tid=5723 Stack: [0x0000003f83111000,0x0000003f83311000], sp=0x0000003f8330e2e0, free space=2036k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0xa6c064] VMError::report_and_die(int, char const*, char const*, void*, Thread*, unsigned char*, void*, void*, char const*, int, unsigned long)+0x16a V [libjvm.so+0xa6ca9e] VMError::report_and_die(Thread*, char const*, int, unsigned long, VMErrorType, char const*, void*)+0x28 V [libjvm.so+0x3ff306] report_vm_out_of_memory(char const*, int, unsigned long, VMErrorType, char const*, ...)+0x6a V [libjvm.so+0x2603de] Chunk::operator new(unsigned long, AllocFailStrategy::AllocFailEnum, unsigned long)+0x108 V [libjvm.so+0x260cf2] Arena::grow(unsigned long, AllocFailStrategy::AllocFailEnum)+0x36 V [libjvm.so+0x8d7392] AdapterHandlerLibrary::create_adapter(AdapterBlob*&, int, BasicType*, bool)+0x39e V [libjvm.so+0x8dcb7e] AdapterHandlerLibrary::get_adapter(methodHandle const&)+0x41e J 5 c1 java.util.ImmutableCollections$SetN.probe(Ljava/lang/Object;)I java.base (56 bytes) @ 0x0000003f696e5858 [0x0000003f696e5700+0x0000000000000158] j java.util.ImmutableCollections$SetN.([Ljava/lang/Object;)V+35 java.base j java.util.Set.of([Ljava/lang/Object;)Ljava/util/Set;+64 java.base j jdk.internal.module.SystemModules$default.moduleDescriptors()[Ljava/lang/module/ModuleDescriptor;+3619 java.base j jdk.internal.module.SystemModuleFinders.of(Ljdk/internal/module/SystemModules;)Ljava/lang/module/ModuleFinder;+1 java.base j jdk.internal.module.ModuleBootstrap.boot2()Ljava/lang/ModuleLayer;+240 java.base j jdk.internal.module.ModuleBootstrap.boot()Ljava/lang/ModuleLayer;+64 java.base j java.lang.System.initPhase2(ZZ)I+0 java.base v ~StubRoutines::call_stub 0x0000003f70c1c49c V [libjvm.so+0x5b790c] JavaCalls::call_helper(JavaValue*, methodHandle const&, JavaCallArguments*, JavaThread*)+0x1d6 V [libjvm.so+0x5b7b68] JavaCalls::call_static(JavaValue*, Klass*, Symbol*, Symbol*, JavaCallArguments*, JavaThread*)+0xe8 V [libjvm.so+0xa0281c] Threads::create_vm(JavaVMInitArgs*, bool*)+0x63c V [libjvm.so+0x647656] JNI_CreateJavaVM+0x6a C [libjli.so+0x3658] JavaMain+0x7a C [libjli.so+0x670e] ThreadJavaMain+0xc 15 ????. 2022 ?., ? 06:25, Xiaolin Zheng > ???????(?): Hi Vladimir, There are some minor updates for the philosophers in Renaissance discussed days before: I have tested the philosophers on my Unmatched board, and found the test itself seems not stable, even if the JMH version. I gave its JMH version a two-day long run, exclusively, but the score varies in the 13000 ms/op range (iterations = 30 by default), even if RVC doesn't get turned on. Have you encountered the same issue? + /home/ubuntu/yunyao/jdk19-release/bin/java -XX:+UnlockExperimentalVMOptions -XX:-UseRVC -jar renaissance-jmh-0.14.1.jar org.renaissance.scala.stm.JmhPhilosophers.runOperation JmhPhilosophers.runOperation ss 40 14307.472 ? 656.456 ms/op JmhPhilosophers.runOperation ss 40 13175.640 ? 303.038 ms/op JmhPhilosophers.runOperation ss 40 13474.124 ? 349.349 ms/op JmhPhilosophers.runOperation ss 40 13545.786 ? 327.735 ms/op JmhPhilosophers.runOperation ss 40 13085.097 ? 306.891 ms/op JmhPhilosophers.runOperation ss 40 12880.270 ? 265.028 ms/op JmhPhilosophers.runOperation ss 40 13232.006 ? 209.613 ms/op JmhPhilosophers.runOperation ss 40 13334.098 ? 443.757 ms/op JmhPhilosophers.runOperation ss 40 13168.990 ? 575.965 ms/op JmhPhilosophers.runOperation ss 40 13424.250 ? 381.084 ms/op JmhPhilosophers.runOperation ss 40 13655.426 ? 428.624 ms/op JmhPhilosophers.runOperation ss 40 14430.485 ? 488.797 ms/op JmhPhilosophers.runOperation ss 40 13999.061 ? 359.320 ms/op JmhPhilosophers.runOperation ss 40 13623.308 ? 531.513 ms/op JmhPhilosophers.runOperation ss 40 13757.331 ? 373.905 ms/op + /home/ubuntu/yunyao/jdk19-release/bin/java -XX:+UnlockExperimentalVMOptions -XX:+UseRVC -jar renaissance-jmh-0.14.1.jar org.renaissance.scala.stm.JmhPhilosophers.runOperation JmhPhilosophers.runOperation ss 40 12772.517 ? 227.409 ms/op JmhPhilosophers.runOperation ss 40 13456.228 ? 498.724 ms/op JmhPhilosophers.runOperation ss 40 13727.211 ? 476.491 ms/op JmhPhilosophers.runOperation ss 40 13122.838 ? 246.673 ms/op JmhPhilosophers.runOperation ss 40 13082.768 ? 405.194 ms/op JmhPhilosophers.runOperation ss 40 13905.753 ? 456.474 ms/op JmhPhilosophers.runOperation ss 40 13503.479 ? 351.191 ms/op JmhPhilosophers.runOperation ss 40 13365.138 ? 380.285 ms/op JmhPhilosophers.runOperation ss 40 13842.509 ? 487.629 ms/op JmhPhilosophers.runOperation ss 40 13965.286 ? 330.423 ms/op JmhPhilosophers.runOperation ss 40 13615.975 ? 352.590 ms/op JmhPhilosophers.runOperation ss 40 13564.777 ? 452.947 ms/op JmhPhilosophers.runOperation ss 40 13720.022 ? 519.965 ms/op JmhPhilosophers.runOperation ss 40 14033.287 ? 404.377 ms/op JmhPhilosophers.runOperation ss 40 13680.432 ? 539.549 ms/op The noise here is a little big; I was wondering if it's stable on the FPGA? Maybe I need to find some more stable tests anyway. Best, Xiaolin ------------------------------------------------------------------ From:Vladimir Kempik > Send Time:2022?9?8?(???) 20:24 To:???(??) > Cc:riscv-port-dev >; Aleksey Shipilev >; riscv-port-dev at openjdk.org > Subject:Re: RVC by default? Hello To be more specific, I saw slight perf decrease with RVC only on a core running on fpga. On thead c910 results ( -RVC and + RVC) are on par. Regards, Vladimir 8 ????. 2022 ?., ? 15:09, Xiaolin Zheng > ???????(?): Hi Aleksey and Vladimir, The current RVC support is okay but not complete: it only covers ~10% of total instructions emitted (mostly C2 code, including some part of Stub code), and we might want to transform instructions into the compressed counterparts as much as possible, so maybe the design will change from a whitelist mode (the class CompressibleRegion) to a black list mode. There is one implementation at my local branch https://github.com/zhengxiaolinX/jdk/commits/REBASE-rvc-beautify (might not be stable yet, I have not gotten enough time to give it a sufficient test on jtregs and specjbb2015/other benchmarks yet). There are plans reserved to commit them (which cover ~20% of instructions under some tests) after reviewing, but this is currently WIP and waiting loom port to merge first. And thank you Vladimir for your observations, I will test the Renaissance benchmark as you have mentioned. I have given tests for specjbb2015 months before and found slight performance increase there; as far as I know, the compile time will increase for the transformation logic is extra overhead during the instruction emission phase, such as the code in Assembler::add. Theoretically, when running the compiled code with RVC turning on, though IPC and CPI are not changed, the code size shrinks; I think it should have the same effect as the icache size becoming larger. Maybe something goes wrong? :-) I might need to look into the performance problem in a high priority, so will test the Renaissance first. Best, Xiaolin ------------------------------------------------------------------ From:Aleksey Shipilev > Send Time:2022?9?8?(???) 18:34 To:undefined ; undefined Subject:RVC by default? Hi, I was looking at some generated code on RISC-V, and realized while we have RVC support, we don't enable it by default. On HiFive Unleashed: $ test-jdk/bin/java -XX:+UnlockExperimentalVMOptions -XX:+PrintFlagsFinal 2>&1 | grep RVC bool UseRVC = false {ARCH experimental} {default} Is there a reason not to do RVC by default? Can we reliably poll the RVC capabilities in current hardware? -- Thanks, -Aleksey -------------- next part -------------- An HTML attachment was scrubbed... URL: From yunyao.zxl at alibaba-inc.com Fri Sep 16 08:45:05 2022 From: yunyao.zxl at alibaba-inc.com (Xiaolin Zheng) Date: Fri, 16 Sep 2022 16:45:05 +0800 Subject: =?UTF-8?B?UmU6IFJWQyBieSBkZWZhdWx0Pw==?= In-Reply-To: <8A383140-8BA7-41CF-9822-1A7933EA6212@gmail.com> References: <4d02fa41-4c35-4186-bb14-8eca06f33d12.yunyao.zxl@alibaba-inc.com> <1E6F3C09-4F32-4C75-B445-D16ED06B568E@gmail.com> <2f3de868-7c71-40f3-a73f-478e35eb6a68.yunyao.zxl@alibaba-inc.com> <2FFEBB14-77AA-41AC-89B3-89F66607B66D@gmail.com> <30868F59-B23D-438B-BF60-9C124520BC15@gmail.com> , <8A383140-8BA7-41CF-9822-1A7933EA6212@gmail.com> Message-ID: Hi Vladimir, Thank you for the newly-added hs_err and further tests. I guess it might have some relationship with the discussed [misaligned issue](https://mail.openjdk.org/pipermail/riscv-port-dev/2022-July/000559.html ) here. But just guessing. I have noticed one interesting thing that the crashed hs_err files are often like: ``` ... 0x0000003f7cc2f91e: mv a0,s7 0x0000003f7cc2f922: auipc t0,0x10494 0x0000003f7cc2f926: jalr 664(t0) # 0x0000003f8d0c3bba = AdapterHandlerEntry::print_adapter_on(outputStream*) const+470 <--- ??? Misaligned address: 0x0000003f7cc2f926 0x0000003f7cc2f92a: sd zero,704(s7) 0x0000003f7cc2f92e: sd zero,712(s7) ... ``` With RVC, it is certainly legal to locate at a 2-byte aligned address. But, this location is relocatable, which means it will be patched. So, I might doubt something weird happened when performing the patching behavior. The patching logic, referencing the instruction segment, does not care about the alignment but only performs 4-byte memory load operations, such as Assembler::patch(). On machines having the misaligned address support, it would not go wrong of course; but it seems that our FPGA board discussed lacks this support. So I guess there might be an interesting chemical reaction happening here. I have written two patches to debug this issue and released one fastdebug build only for debugging purposes, and I put it at https://github.com/zhengxiaolinX/jdk/releases/tag/test-unaligned . The two patches fix most of the misaligned accesses to the instruction segment. I was wondering if you could have a simple test of it when you are available, to see whether this issue still exists? I am just guessing the problem here, hope it can reveal something to us. Thank you very much. Best, Xiaolin ------------------------------------------------------------------ From:Vladimir Kempik Send Time:2022?9?15?(???) 23:25 To:???(??) Cc:riscv-port-dev ; Aleksey Shipilev ; riscv-port-dev at openjdk.org Subject:Re: RVC by default? Hello Looks pretty similar to me. for me it was vanilla recent jdk19 But later, when I backported next patches to my jdk19 branch, the issue became different ( Arena alloc issue I have reported earlier): 8290496: riscv: Fix build warnings-as-errors with GCC 11 8290280: riscv: Clean up stack and register handling in interpreter 8290137: riscv: small refactoring for add_memory_int32/64 8290164: compiler/runtime/TestConstantsInError.java fails on riscv 8291952: riscv: Remove PRAGMA_NONNULL_IGNORED 8291947: riscv: fail to build after JDK-8290840 8291893: riscv: remove fence.i used in user space Backport-of:... 8292713: Unsafe.allocateInstance should be intrinsified without UseUnalignedAccesses 8292867: RISC-V: Simplify weak CAS return value handling 8292407: Improve Weak CAS VarHandle/Unsafe tests resilience under spurious failures 8293100: RISC-V: Need to save and restore callee-saved FloatRegisters in... 8293050: RISC-V: Remove redundant non-null assertions about macro-assembler 8293474: RISC-V: Unify the way of moving function pointer 8293524: RISC-V: Use macro-assembler functions as appropriate 8293566: RISC-V: Clean up push and pop registers I?m gonna bisect this list and find what changed the behaviour. The workaround says - update to ubuntu 21.04, but its not clear - update runtime environment or build environment. For me the runtime is ubuntu 22.04, but I build the jdk with sysroot of ubuntu 20.04 ( for better compatibility) and gcc 11.2 Regards, Vladimir 15 ????. 2022 ?., ? 18:17, Xiaolin Zheng > ???????(?): Hi Vladimir, The mailing list says my e-mail exceeds 40KB so I get rejected. But I want to send it out anyway before getting off today's work. So here is a work around: Please check: https://gist.github.com/zhengxiaolinX/25c32853690f7ac1c125d2fe1da19710 Looking forward to your opinions. Best, Xiaolin -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladimir.kempik at gmail.com Fri Sep 16 11:49:03 2022 From: vladimir.kempik at gmail.com (Vladimir Kempik) Date: Fri, 16 Sep 2022 14:49:03 +0300 Subject: RVC by default? In-Reply-To: References: <4d02fa41-4c35-4186-bb14-8eca06f33d12.yunyao.zxl@alibaba-inc.com> <1E6F3C09-4F32-4C75-B445-D16ED06B568E@gmail.com> <2f3de868-7c71-40f3-a73f-478e35eb6a68.yunyao.zxl@alibaba-inc.com> <2FFEBB14-77AA-41AC-89B3-89F66607B66D@gmail.com> <30868F59-B23D-438B-BF60-9C124520BC15@gmail.com> <8A383140-8BA7-41CF-9822-1A7933EA6212@gmail.com> Message-ID: <0CA85602-5A58-4B6B-9959-474E6D2F09A4@gmail.com> Hello. I have applied your two fixes on top of jdk19 and tested - all issues have gone, thanks. Looks like the source of the issues is in my m-mode. Kinds Regards, Vladimir > 16 ????. 2022 ?., ? 11:45, Xiaolin Zheng ???????(?): > > Hi Vladimir, > > Thank you for the newly-added hs_err and further tests. I guess it might have some relationship with the discussed [misaligned issue](https://mail.openjdk.org/pipermail/riscv-port-dev/2022-July/000559.html ) here. But just guessing. I have noticed one interesting thing that the crashed hs_err files are often like: > > ``` > ... > 0x0000003f7cc2f91e: mv a0,s7 > 0x0000003f7cc2f922: auipc t0,0x10494 > 0x0000003f7cc2f926: jalr 664(t0) # 0x0000003f8d0c3bba = AdapterHandlerEntry::print_adapter_on(outputStream*) const+470 <--- ??? Misaligned address: 0x0000003f7cc2f926 > 0x0000003f7cc2f92a: sd zero,704(s7) > 0x0000003f7cc2f92e: sd zero,712(s7) > ... > ``` > > With RVC, it is certainly legal to locate at a 2-byte aligned address. But, this location is relocatable, which means it will be patched. > > So, I might doubt something weird happened when performing the patching behavior. The patching logic, referencing the instruction segment, does not care about the alignment but only performs 4-byte memory load operations, such as Assembler::patch(). On machines having the misaligned address support, it would not go wrong of course; but it seems that our FPGA board discussed lacks this support. So I guess there might be an interesting chemical reaction happening here. I have written two patches to debug this issue and released one fastdebug build only for debugging purposes, and I put it at https://github.com/zhengxiaolinX/jdk/releases/tag/test-unaligned . The two patches fix most of the misaligned accesses to the instruction segment. I was wondering if you could have a simple test of it when you are available, to see whether this issue still exists? > > I am just guessing the problem here, hope it can reveal something to us. > > Thank you very much. > > Best, > Xiaolin > > ------------------------------------------------------------------ > From:Vladimir Kempik > Send Time:2022?9?15?(???) 23:25 > To:???(??) > Cc:riscv-port-dev ; Aleksey Shipilev ; riscv-port-dev at openjdk.org > Subject:Re: RVC by default? > > Hello > Looks pretty similar to me. > for me it was vanilla recent jdk19 > But later, when I backported next patches to my jdk19 branch, the issue became different ( Arena alloc issue I have reported earlier): > > 8290496: riscv: Fix build warnings-as-errors with GCC 11 > 8290280: riscv: Clean up stack and register handling in interpreter > 8290137: riscv: small refactoring for add_memory_int32/64 > 8290164: compiler/runtime/TestConstantsInError.java fails on riscv > 8291952: riscv: Remove PRAGMA_NONNULL_IGNORED > 8291947: riscv: fail to build after JDK-8290840 > 8291893: riscv: remove fence.i used in user space Backport-of:... > 8292713: Unsafe.allocateInstance should be intrinsified without UseUnalignedAccesses > 8292867: RISC-V: Simplify weak CAS return value handling > 8292407: Improve Weak CAS VarHandle/Unsafe tests resilience under spurious failures > 8293100: RISC-V: Need to save and restore callee-saved FloatRegisters in... > 8293050: RISC-V: Remove redundant non-null assertions about macro-assembler > 8293474: RISC-V: Unify the way of moving function pointer > 8293524: RISC-V: Use macro-assembler functions as appropriate > 8293566: RISC-V: Clean up push and pop registers > > I?m gonna bisect this list and find what changed the behaviour. > > The workaround says - update to ubuntu 21.04, but its not clear - update runtime environment or build environment. > For me the runtime is ubuntu 22.04, but I build the jdk with sysroot of ubuntu 20.04 ( for better compatibility) and gcc 11.2 > > Regards, Vladimir > 15 ????. 2022 ?., ? 18:17, Xiaolin Zheng > ???????(?): > > Hi Vladimir, > > The mailing list says my e-mail exceeds 40KB so I get rejected. But I want to send it out anyway before getting off today's work. So here is a work around: > Please check: > https://gist.github.com/zhengxiaolinX/25c32853690f7ac1c125d2fe1da19710 > > Looking forward to your opinions. > > Best, > Xiaolin > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yunyao.zxl at alibaba-inc.com Fri Sep 16 12:07:00 2022 From: yunyao.zxl at alibaba-inc.com (Xiaolin Zheng) Date: Fri, 16 Sep 2022 20:07:00 +0800 Subject: =?UTF-8?B?UmU6IFJWQyBieSBkZWZhdWx0Pw==?= In-Reply-To: <0CA85602-5A58-4B6B-9959-474E6D2F09A4@gmail.com> References: <4d02fa41-4c35-4186-bb14-8eca06f33d12.yunyao.zxl@alibaba-inc.com> <1E6F3C09-4F32-4C75-B445-D16ED06B568E@gmail.com> <2f3de868-7c71-40f3-a73f-478e35eb6a68.yunyao.zxl@alibaba-inc.com> <2FFEBB14-77AA-41AC-89B3-89F66607B66D@gmail.com> <30868F59-B23D-438B-BF60-9C124520BC15@gmail.com> <8A383140-8BA7-41CF-9822-1A7933EA6212@gmail.com> , <0CA85602-5A58-4B6B-9959-474E6D2F09A4@gmail.com> Message-ID: Hi Vladimir, Thank you for the update, and good to know all's right at last. Maybe we need support for unaligned addresses in the backend, and it seems RVC could trigger it without any effort, for its 2-byte nature. As discussed in https://mail.openjdk.org/pipermail/riscv-port-dev/2022-July/000563.html , the alignment issue is claimed by Yadong's team. So we can wait for their fixes then. During this gap maybe our fixes could help to workaround this issue (but may harm the performance a little). Again good to know the root cause! Best, Xiaolin ------------------------------------------------------------------ From:Vladimir Kempik Send Time:2022?9?16?(???) 19:49 To:???(??) Cc:riscv-port-dev ; Aleksey Shipilev ; riscv-port-dev at openjdk.org Subject:Re: RVC by default? Hello. I have applied your two fixes on top of jdk19 and tested - all issues have gone, thanks. Looks like the source of the issues is in my m-mode. Kinds Regards, Vladimir 16 ????. 2022 ?., ? 11:45, Xiaolin Zheng > ???????(?): Hi Vladimir, Thank you for the newly-added hs_err and further tests. I guess it might have some relationship with the discussed [misaligned issue](https://mail.openjdk.org/pipermail/riscv-port-dev/2022-July/000559.html ) here. But just guessing. I have noticed one interesting thing that the crashed hs_err files are often like: ``` ... 0x0000003f7cc2f91e: mv a0,s7 0x0000003f7cc2f922: auipc t0,0x10494 0x0000003f7cc2f926: jalr 664(t0) # 0x0000003f8d0c3bba = AdapterHandlerEntry::print_adapter_on(outputStream*) const+470 <--- ??? Misaligned address: 0x0000003f7cc2f926 0x0000003f7cc2f92a: sd zero,704(s7) 0x0000003f7cc2f92e: sd zero,712(s7) ... ``` With RVC, it is certainly legal to locate at a 2-byte aligned address. But, this location is relocatable, which means it will be patched. So, I might doubt something weird happened when performing the patching behavior. The patching logic, referencing the instruction segment, does not care about the alignment but only performs 4-byte memory load operations, such as Assembler::patch(). On machines having the misaligned address support, it would not go wrong of course; but it seems that our FPGA board discussed lacks this support. So I guess there might be an interesting chemical reaction happening here. I have written two patches to debug this issue and released one fastdebug build only for debugging purposes, and I put it at https://github.com/zhengxiaolinX/jdk/releases/tag/test-unaligned . The two patches fix most of the misaligned accesses to the instruction segment. I was wondering if you could have a simple test of it when you are available, to see whether this issue still exists? I am just guessing the problem here, hope it can reveal something to us. Thank you very much. Best, Xiaolin ------------------------------------------------------------------ From:Vladimir Kempik > Send Time:2022?9?15?(???) 23:25 To:???(??) > Cc:riscv-port-dev >; Aleksey Shipilev >; riscv-port-dev at openjdk.org > Subject:Re: RVC by default? Hello Looks pretty similar to me. for me it was vanilla recent jdk19 But later, when I backported next patches to my jdk19 branch, the issue became different ( Arena alloc issue I have reported earlier): 8290496: riscv: Fix build warnings-as-errors with GCC 11 8290280: riscv: Clean up stack and register handling in interpreter 8290137: riscv: small refactoring for add_memory_int32/64 8290164: compiler/runtime/TestConstantsInError.java fails on riscv 8291952: riscv: Remove PRAGMA_NONNULL_IGNORED 8291947: riscv: fail to build after JDK-8290840 8291893: riscv: remove fence.i used in user space Backport-of:... 8292713: Unsafe.allocateInstance should be intrinsified without UseUnalignedAccesses 8292867: RISC-V: Simplify weak CAS return value handling 8292407: Improve Weak CAS VarHandle/Unsafe tests resilience under spurious failures 8293100: RISC-V: Need to save and restore callee-saved FloatRegisters in... 8293050: RISC-V: Remove redundant non-null assertions about macro-assembler 8293474: RISC-V: Unify the way of moving function pointer 8293524: RISC-V: Use macro-assembler functions as appropriate 8293566: RISC-V: Clean up push and pop registers I?m gonna bisect this list and find what changed the behaviour. The workaround says - update to ubuntu 21.04, but its not clear - update runtime environment or build environment. For me the runtime is ubuntu 22.04, but I build the jdk with sysroot of ubuntu 20.04 ( for better compatibility) and gcc 11.2 Regards, Vladimir 15 ????. 2022 ?., ? 18:17, Xiaolin Zheng > ???????(?): Hi Vladimir, The mailing list says my e-mail exceeds 40KB so I get rejected. But I want to send it out anyway before getting off today's work. So here is a work around: Please check: https://gist.github.com/zhengxiaolinX/25c32853690f7ac1c125d2fe1da19710 Looking forward to your opinions. Best, Xiaolin -------------- next part -------------- An HTML attachment was scrubbed... URL: From yangfei at iscas.ac.cn Sat Sep 17 13:12:14 2022 From: yangfei at iscas.ac.cn (yangfei at iscas.ac.cn) Date: Sat, 17 Sep 2022 21:12:14 +0800 (GMT+08:00) Subject: Discuss the RVC implementation In-Reply-To: <2d7bbad2-7ade-4b38-91b5-12c4c0a91602.yunyao.zxl@alibaba-inc.com> References: <2d7bbad2-7ade-4b38-91b5-12c4c0a91602.yunyao.zxl@alibaba-inc.com> Message-ID: <42bdf74a.1322.1834b9400ed.Coremail.yangfei@iscas.ac.cn> Hi Xiaolin, Your new proposal for supporting the RVC extension looks interesting. May I ask if you have any performance data including code size measured? Also it's appreciated if you have more details about the issue with MachBranch nodes. Thanks, Fei -----Original Messages----- From:"Xiaolin Zheng" Sent Time:2022-09-15 10:52:59 (Thursday) To: riscv-port-dev Cc: Subject: Discuss the RVC implementation Hi team, I am going to describe a different implementation of RVC for our backend. ## Background The RISC-V C extension, also known as RVC, could transform 4-byte instructions to 2-byte counterparts when eligible (for example, as the manual, Rd/Rs of instruction ranges from [x8,x15] might be one common requirement, etc.). ## The current implementation in the Hotspot The current implementation[0] is a transient one, introducing a "CompressibleRegion" by using RTTI[1] to indicate that instructions inside these regions can be safely substituted by the RVC counterparts, if convertible; and the implementation also uses a, say, "whitelist mode" by using the "CompressibleRegion" mentioned above to "manually mark out safe regions", then batch emit them if could. However, after a deeper look, we might discover the current "whitelist mode" has several shortages: ## Shortages of the current implementation 1. Coverages: The current implementation only covers some of C2 match rules, and only some small part of stub code, so there is obviously far more space to reduce the total code size. In my observations, some RISC-V instruction sequences generally occupy a bit more space than AArch64 ones[2]. With the new implementations, we could achieve a code size level alike AArch64's generated code. Some better, some still worse than AArch64 one in my simple observation. 2. Though safe, I'd say it's very much not easy to maintain. The background is, most of the patchable instructions cannot be easily transformed into their shorter counterparts[3], and they need to be prevented from being compressed. So comes the question: we must make sure no patchable relocation is inside the range of a "CompressibleRegion". For example, the string comparison intrinsic function[4] looks very delicious: transforming it and its siblings may result in a yummy compression rate. But programmers might have to check lots of its callees to find if there is just one patchable relocation hidden inside that causes the whole intrinsic incompressible. This could cause extra burden for programmers, so I bet no one would like to add "CompressibleRegion" for his/her code :-) 3. Performance: Better performance of generated code is a little side effect this extension gives us, the smaller the I$ size, the better performance though - please see Andrew Waterman's paper[5] for more reference there. Anyway, it looks like a higher general compression rate is better for performance. The main issue here is the granularity of "CompressibleRegion" is a bit coarse. "Why not exclude the incompressible parts" may come up to us naturally. And after some diggings, we may find: we just need to exclude countable places that would be patched back (mostly relocations), and several code slices with a fixed length, which will be calculated, such as "emit_static_call_stub". All remaining instructions could be safely transformed into RVC counterparts if eligible. So maybe, say, the "blacklist mode"? ## The new implementation To implement the "blacklist mode" in the backend, we need two things: 1. an "IncompressibleRegion", indicating instructions inside it should remain in their normal 4-byte form no matter what happens. 2. a simple strategy to exclude patchable instructions, mainly for relocations. So we can see the new strategy is highly bounded to relocations' positions: We all know the "relocate()" in Hotspot VM is a mark that only has an explicit "start point" without an end point, and some of them could be patched back. Therefore, we can use a simple strategy: introduce a lambda as another argument to assign "end point" semantics to the relocations, for completing our requirements without extra costs. For example: Originally: ``` __ relocate(safepoint_pc.rspec()); __ la(t0, safepoint_pc.target()); __ sd(t0, Address(xthread, JavaThread::saved_exception_pc_offset())); ``` After introducing a simple lambda as an extra argument: ``` __ relocate(safepoint_pc.rspec(), [&] { // The relocate() hides an "IncompressibleRegion" in it __ la(t0, safepoint_pc.target()); // This patchable instruction sequence is incompressible }); _ sd(t0, Address(xthread, JavaThread::saved_exception_pc_offset())); ``` Well, simple but effective. Excluding such countable dynamically patchable places and unifying all relocations, all other instructions can be safely transformed, without messing up the current code style. Programmers could just keep aligning the same style; most of the time they have no need to care about whether the RVC exists or not and things get converted automatically. The proposed new sample code is again, here[6]. ## Other things worth being noticed 1. Instruction patching issues With the C extension, the backend mixes with both 2-byte and 4-byte instructions. It gets a little CISC alike. We know the Hotspot would patch instructions when code is running at full speed, such as call instructions, nops used for deoptimizations (the nops at the entry points, and post-call nops after loom). Instruction patching is delicate so we must carefully handle such places, to keep these 4-byte instructions from spanning cachelines. Though remaining a 4-byte normal form even with RVC, they might sit at a 2-byte aligned boundary. Such cases should definitely not happen, for patching such places spanning cachelines would lose the atomicity. So shortly, we must properly align them, such as [7][8]. Such a problem could exist with RVC, no matter "whitelist mode" or "blacklist mode". It is a general problem for instruction patching. I will add more strong assertions to the potential places (trampoline_call might be a very good spot, for patchable "static_call", "opt_virtual" and "virtual" relocations) to check alignment in the future patches. 2. MachBranch Nodes And MachBranch nodes: they are not easy to be tamed because the "fake label"[9] in PhaseOutput::scratch_emit_size() cannot tell us the real distance of the label. But we can leave them alone in this discussion, for there will be patches to handle those afterward. That's nearly all. Thanks for reaching here despite the verbosity. It would be very nice to receive any suggestions. Best, Xiaolin [0] Original patch: https://github.com/openjdk/riscv-port/pull/34 [1] Of course, the "CompressibleRegion" is good, I like it; and this idea is not from myself. [2] For a simple example, a much commonly used fixed-length movptr() uses up six 4-byte instructions (lui+addi+slli+addi+slli+addi, MIPS alike instructions using arithmetical calculations with signed extensions, but not anyone's fault :-) ), while the AArch64 counterpart only takes three 4-byte instructions (movz+movk+movk). They are both going to mov a 48-bit immediate. After accumulation, the size differs quite a lot. [3] 2-byte instructions have fewer bits, so comes shorter immediate encoding etc. compared to the 4-byte counterparts. After we transform patchable instructions (ones at marks of patchable relocations, etc.) to 2-byte ones, when they are patched to a larger value or farther distances afterward, it is possible that they sadly find themselves, the shorter instructions, cannot cover the newly patched value. So we need to exclude patchable instructions (at the relocation marks etc.) from being compressed. [4] https://github.com/openjdk/jdk/blob/7f3250d71c4866a64eb73f52140c669fe90f122f/src/hotspot/cpu/riscv/riscv.ad#L10032-L10035 [5] https://digitalassets.lib.berkeley.edu/etd/ucb/text/Waterman_berkeley_0028E_15908.pdf, Page 64: "5.4 The RVC Extension, Performance Implications" [6] https://github.com/zhengxiaolinX/jdk/tree/REBASE-rvc-beautify [7] https://github.com/openjdk/jdk/blob/7f3250d71c4866a64eb73f52140c669fe90f122f/src/hotspot/cpu/riscv/riscv.ad#L9873 [8] https://github.com/openjdk/jdk/blob/7f3250d71c4866a64eb73f52140c669fe90f122f/src/hotspot/cpu/riscv/c1_LIRAssembler_riscv.cpp#L1348-L1353 [9] https://github.com/openjdk/jdk/blob/211fab8d361822bbd1a34a88626853bf4a029af5/src/hotspot/share/opto/output.cpp#L3331-L3340 -------------- next part -------------- An HTML attachment was scrubbed... URL: From caogui at iscas.ac.cn Mon Sep 19 14:10:12 2022 From: caogui at iscas.ac.cn (zifeihan) Date: Mon, 19 Sep 2022 22:10:12 +0800 Subject: [vectorIntrinsics] Vector API for RISC-V Message-ID: <09D7332A-09AA-46DF-8EBD-19CA5A484A6D@iscas.ac.cn> # Summary The implementation of vector nodes plays an important role in the implementation of the Vector-API. In the current RISC-V backend implementation of the OpenJDK, some vector nodes have been implemented using the RISC-V V extensions, e.g. `LoadVector,StoreVector,AddVB` and so on. With these vector node implementations, the C2 compiler is able to handle some specific vector computations faster and with better performance. However, the current vector node implementations are still lacking compared to AARCH64's SVE/NEON and X86's avx512, for example: `Op_LoadVectorGather,Op_StoreVectorScatter,AndReductionV` and so on. Therefore, we currently want to make more vector node implementations based on RISC-V V extensions for the RISC-V backend of OpenJDK first. # Status According to our understanding, the C2 vector node of the RISC-V V extension currently exists to allow the program to use more of the RISC-V V extension during runtime, thus reducing the number of assembly instructions (using a single instruction, multiple data mode), thus allowing for faster execution of the program. Currently, the Vector API works fine on the OpenJDK RISC-V platform, but because some vector nodes are missing, the Vector API C2 mode uses the normal C2 nodes for the unimplemented C2 vector nodes, so that the lack of vectorized nodes does not cause the Vector API to be used in the OpenJDK RISC-V platform. API is not available on the OpenJDK RISC-V platform due to the lack of vectorized nodes. https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int256VectorTests.java#ANDReduceInt256VectorTests This test performs AndReduce operations on a set of data. By printing the C2 execution log of the method, we can see that the method also performs C2 compilation, but it is implemented using normal C2 nodes and does not use the RISC-V V extensions. # Example The following implementation of AndReduce for the Vector API uses the RISC-V V extension, which provides 32 vector registers and an instruction set to manipulate them. These instruction sets enable vectorization operations similar to AARCH64's SVE, where the RISC-V V extension instruction set precedes operations on vector register data, Some RISC-V V extended instruction sets operate on registers that can contain scalar (normal) registers, for example `vop.vx vd, vs2, rs1, vm # integer vector-scalar vd[i] = vs2[i] op x[rs1]` . For the case where more RISC-V V extension instructions operate on vector registers, the data needs to be loaded into the vector registers first, and then the RISC-V V extension instruction set operates on the vector registers. The Vector API's AndReduce is similar to the existing AddReduce in that it loads data from memory/scalar registers into vector registers, then operates on the vector registers, and finally moves the data to the scalar registers. Since the loading and storage of vector data has already been implemented (src/hotspot/cpu/riscv/riscv_v.ad), we refer to `AddReductionVI` and implement `AndReductionV`, the main implementation node of AndReduce for the Vector API. ``` instruct reduce_andI(iRegINoSp dst, iRegIorL2I src1, vReg src2, vReg tmp) %{ predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_INT); match(Set dst (AndReductionV src1 src2)); effect(TEMP tmp); ins_cost(VEC_COST); format %{ "vmv.s.x $tmp, $src1\t#@reduce_andI\n\t" "vredand.vs $tmp, $src2, $tmp\n\t" "vmv.x.s $dst, $tmp" %} ins_encode %{ __ vsetvli(t0, x0, Assembler::e32); __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), as_VectorRegister($tmp$$reg)); __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); %} ins_pipe(pipe_slow); %} ``` The `T_INT` data type is implemented here, and the implementation is given in a different node for `T_BYTE, T_SHORT, T_LONG`. After implementation, the compilation log of the https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int256VectorTests.java#ANDReduceInt256VectorTests method is printed, and RISC-V is enabled. After implementation, the compilation log of the method is printed, and the RISC-V V extension is enabled, so that the execution of the method matches the new AndReductionV node. ``` 27c B21: # out( B25 B22 ) <- in( B20 ) Freq: 32.4376 27c # castII of R8, #@castII 27c addw R7, R8, zr #@convI2L_reg_reg 280 slli R29, R7, (#2 & 0x3f) #@lShiftL_reg_imm 284 spill [sp, #24] -> R7 # spill size = 64 288 add R7, R7, R29 # ptr, #@addP_reg_reg 28c addi R7, R7, #16 # ptr, #@addP_reg_imm 290 vle V2, [R7] #@loadV 298 .... 2c0 vmv.s.x V1, R7 #@reduce_andI vredand.vs V1, V2, V1 vmv.x.s R28, V1 ``` # Test tips 1. After implementing each vector node, write test cases for that node, perform rigorous functional testing, and perform complete testing of the vector in jtreg. 2. Print the JAVA test case method using the vector node, and analyze the compilation log to confirm that the optimization of the C2 Vector Node is occurring correctly. 2. We plan to add JMH test cases for each C2 vector node to test the performance comparison before and after adding. 3. Since no physical machine capable of executing RISC-V V extensions has been found, the above tests were performed with the RISC-V V extensions v1.0 enabled in QEMU. # Performance Test Continue using https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int256VectorTests.java#ANDReduceInt256VectorTests to Test the performance before and after implementing the RISC-V V extensions added. Method ADDReduceInt256VectorTests, ANDReduceInt256VectorTests, ORReduceInt256VectorTests, XORReduceInt256VectorTests, negInt256VectorTests and NEGInt256VectorTests under `test/jdk/jdk/incubator/vector` are tested. The sum of execution time shows ~50.7% reduction on average. # Goals and roadmap Considering code safety and testing, we plan to implement the Vector API step by step according to the C2 Vector Node types required by the Vector API. For example, we will separate `AndReductionV, OrReductionV, XorReductionV` into one class, `VectorCastB2X, VectorCastS2X, VectorCastD2X` into one class, and so on, and then we will submit PRs upstream according to the C2 Vector Node type. In order to keep the code safe, we will implement the simple vector nodes first, from simple to hard, and avoid modifying other public code in the process for the time being. After RISC-V's missing vectorization nodes are added, we will adjust and announce the next work plan in time. These are our goals and plans, and we welcome suggestions and corrections from the community. From vladimir.kempik at gmail.com Mon Sep 19 21:27:02 2022 From: vladimir.kempik at gmail.com (Vladimir Kempik) Date: Tue, 20 Sep 2022 00:27:02 +0300 Subject: RVC by default? In-Reply-To: <0CA85602-5A58-4B6B-9959-474E6D2F09A4@gmail.com> References: <4d02fa41-4c35-4186-bb14-8eca06f33d12.yunyao.zxl@alibaba-inc.com> <1E6F3C09-4F32-4C75-B445-D16ED06B568E@gmail.com> <2f3de868-7c71-40f3-a73f-478e35eb6a68.yunyao.zxl@alibaba-inc.com> <2FFEBB14-77AA-41AC-89B3-89F66607B66D@gmail.com> <30868F59-B23D-438B-BF60-9C124520BC15@gmail.com> <8A383140-8BA7-41CF-9822-1A7933EA6212@gmail.com> <0CA85602-5A58-4B6B-9959-474E6D2F09A4@gmail.com> Message-ID: Hello Some numbers on performance, 100% - time of average philosophers run on repetitions 4-9 ( filtering out some too slow results), on default jdk19 without RVC to make results more a less repeatable, I had to "isolate" 2 of 4 harts, from the rest of the system. These two harts can only be assigned to the process explicitly. Basically, two harts are running only java threads. Java threads are running only on these two harts. This minimizes an effect of other processes on the result ( especially on low-clocked fpga cores). jdk19 - 100% +- 4% jdk19 + JDK-8294012 - 100% +- 4% jdk19 + unaligned access patch from Xiaolin + JDK-8294012 - 102 % +- 2.5 % jdk19 + unaligned access patch from Xiaolin + JDK-8294012 + UseRVC - 95% +- 1% Regards, Vladimir > 16 ????. 2022 ?., ? 14:49, Vladimir Kempik ???????(?): > > Hello. > I have applied your two fixes on top of jdk19 and tested - all issues have gone, thanks. > Looks like the source of the issues is in my m-mode. > > Kinds Regards, Vladimir > >> 16 ????. 2022 ?., ? 11:45, Xiaolin Zheng > ???????(?): >> >> Hi Vladimir, >> >> Thank you for the newly-added hs_err and further tests. I guess it might have some relationship with the discussed [misaligned issue](https://mail.openjdk.org/pipermail/riscv-port-dev/2022-July/000559.html ) here. But just guessing. I have noticed one interesting thing that the crashed hs_err files are often like: >> >> ``` >> ... >> 0x0000003f7cc2f91e: mv a0,s7 >> 0x0000003f7cc2f922: auipc t0,0x10494 >> 0x0000003f7cc2f926: jalr 664(t0) # 0x0000003f8d0c3bba = AdapterHandlerEntry::print_adapter_on(outputStream*) const+470 <--- ??? Misaligned address: 0x0000003f7cc2f926 >> 0x0000003f7cc2f92a: sd zero,704(s7) >> 0x0000003f7cc2f92e: sd zero,712(s7) >> ... >> ``` >> >> With RVC, it is certainly legal to locate at a 2-byte aligned address. But, this location is relocatable, which means it will be patched. >> >> So, I might doubt something weird happened when performing the patching behavior. The patching logic, referencing the instruction segment, does not care about the alignment but only performs 4-byte memory load operations, such as Assembler::patch(). On machines having the misaligned address support, it would not go wrong of course; but it seems that our FPGA board discussed lacks this support. So I guess there might be an interesting chemical reaction happening here. I have written two patches to debug this issue and released one fastdebug build only for debugging purposes, and I put it at https://github.com/zhengxiaolinX/jdk/releases/tag/test-unaligned . The two patches fix most of the misaligned accesses to the instruction segment. I was wondering if you could have a simple test of it when you are available, to see whether this issue still exists? >> >> I am just guessing the problem here, hope it can reveal something to us. >> >> Thank you very much. >> >> Best, >> Xiaolin >> >> ------------------------------------------------------------------ >> From:Vladimir Kempik > >> Send Time:2022?9?15?(???) 23:25 >> To:???(??) > >> Cc:riscv-port-dev >; Aleksey Shipilev >; riscv-port-dev at openjdk.org > >> Subject:Re: RVC by default? >> >> Hello >> Looks pretty similar to me. >> for me it was vanilla recent jdk19 >> But later, when I backported next patches to my jdk19 branch, the issue became different ( Arena alloc issue I have reported earlier): >> >> 8290496: riscv: Fix build warnings-as-errors with GCC 11 >> 8290280: riscv: Clean up stack and register handling in interpreter >> 8290137: riscv: small refactoring for add_memory_int32/64 >> 8290164: compiler/runtime/TestConstantsInError.java fails on riscv >> 8291952: riscv: Remove PRAGMA_NONNULL_IGNORED >> 8291947: riscv: fail to build after JDK-8290840 >> 8291893: riscv: remove fence.i used in user space Backport-of:... >> 8292713: Unsafe.allocateInstance should be intrinsified without UseUnalignedAccesses >> 8292867: RISC-V: Simplify weak CAS return value handling >> 8292407: Improve Weak CAS VarHandle/Unsafe tests resilience under spurious failures >> 8293100: RISC-V: Need to save and restore callee-saved FloatRegisters in... >> 8293050: RISC-V: Remove redundant non-null assertions about macro-assembler >> 8293474: RISC-V: Unify the way of moving function pointer >> 8293524: RISC-V: Use macro-assembler functions as appropriate >> 8293566: RISC-V: Clean up push and pop registers >> >> I?m gonna bisect this list and find what changed the behaviour. >> >> The workaround says - update to ubuntu 21.04, but its not clear - update runtime environment or build environment. >> For me the runtime is ubuntu 22.04, but I build the jdk with sysroot of ubuntu 20.04 ( for better compatibility) and gcc 11.2 >> >> Regards, Vladimir >> 15 ????. 2022 ?., ? 18:17, Xiaolin Zheng > ???????(?): >> >> Hi Vladimir, >> >> The mailing list says my e-mail exceeds 40KB so I get rejected. But I want to send it out anyway before getting off today's work. So here is a work around: >> Please check: >> https://gist.github.com/zhengxiaolinX/25c32853690f7ac1c125d2fe1da19710 >> >> Looking forward to your opinions. >> >> Best, >> Xiaolin >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From paul.sandoz at oracle.com Mon Sep 19 18:34:49 2022 From: paul.sandoz at oracle.com (Paul Sandoz) Date: Mon, 19 Sep 2022 18:34:49 +0000 Subject: [vectorIntrinsics] Vector API for RISC-V In-Reply-To: <09D7332A-09AA-46DF-8EBD-19CA5A484A6D@iscas.ac.cn> References: <09D7332A-09AA-46DF-8EBD-19CA5A484A6D@iscas.ac.cn> Message-ID: <7027E6B5-79A9-4701-888A-9C2891845D99@oracle.com> Hi, Thank you, very encouraging, and looks a reasonable plan, some suggestions below. Support for the Vector API should more easily result in better support for the auto-vectorizer. 1. I think you can submit PRs to https://github.com/openjdk/jdk/ and then those changes can be brought into the Panama repo if need be. That assumes support for RISC-V V extension does not require substantial adjustments to C2 or the API, and from what you say RISC-V does not require such adjustments. Note: going forward I expect most architectural development to focus on alignment with Valhalla?s value classes/types and support for vector calling conventions. There is also work to research support for FP16 vectors, which is also connected with Valhalla, which can be considered more incremental. 2. The Panama repository also has support for generating JMH benchmarks in addition to unit tests, you may find those helpful, rather than writing your own. Testing-wise I would have liked to revamp the test framework to generate Java tests from a Java code and leverage HotSpot?s IR Test Framework [1]. Alas, I don?t have the time right now. We could do more to align with HotSpot?s IR framework to not only assert on results, but also assert that C2 IR nodes are generated. (It may be the test generator needs to query the platform for supported vector nodes, via say enhancements to the WhiteBox API). While JMH performance tests have their place using the IR framework is I think better approach longer term for testing. Paul. [1] https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/lib/ir_framework/README.md > On Sep 19, 2022, at 7:10 AM, zifeihan wrote: > > # Summary > > The implementation of vector nodes plays an important role in the implementation of the Vector-API. In the current RISC-V backend implementation of the OpenJDK, some vector nodes have been implemented using the RISC-V V extensions, e.g. `LoadVector,StoreVector,AddVB` and so on. With these vector node implementations, the C2 compiler is able to handle some specific vector computations faster and with better performance. However, the current vector node implementations are still lacking compared to AARCH64's SVE/NEON and X86's avx512, for example: `Op_LoadVectorGather,Op_StoreVectorScatter,AndReductionV` and so on. > Therefore, we currently want to make more vector node implementations based on RISC-V V extensions for the RISC-V backend of OpenJDK first. > > # Status > > According to our understanding, the C2 vector node of the RISC-V V extension currently exists to allow the program to use more of the RISC-V V extension during runtime, thus reducing the number of assembly instructions (using a single instruction, multiple data mode), thus allowing for faster execution of the program. Currently, the Vector API works fine on the OpenJDK RISC-V platform, but because some vector nodes are missing, the Vector API C2 mode uses the normal C2 nodes for the unimplemented C2 vector nodes, so that the lack of vectorized nodes does not cause the Vector API to be used in the OpenJDK RISC-V platform. API is not available on the OpenJDK RISC-V platform due to the lack of vectorized nodes. > > https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int256VectorTests.java#ANDReduceInt256VectorTests > > This test performs AndReduce operations on a set of data. By printing the C2 execution log of the method, we can see that the method also performs C2 compilation, but it is implemented using normal C2 nodes and does not use the RISC-V V extensions. > > # Example > > The following implementation of AndReduce for the Vector API uses the RISC-V V extension, which provides 32 vector registers and an instruction set to manipulate them. These instruction sets enable vectorization operations similar to AARCH64's SVE, where the RISC-V V extension instruction set precedes operations on vector register data, Some RISC-V V extended instruction sets operate on registers that can contain scalar (normal) registers, for example `vop.vx vd, vs2, rs1, vm # integer vector-scalar vd[i] = vs2[i] op x[rs1]` . For the case where more RISC-V V extension instructions operate on vector registers, the data needs to be loaded into the vector registers first, and then the RISC-V V extension instruction set operates on the vector registers. The Vector API's AndReduce is similar to the existing AddReduce in that it loads data from memory/scalar registers into vector registers, then operates on the vector registers, and finally moves the data to the scalar registers. Since the loading and storage of vector data has already been implemented (src/hotspot/cpu/riscv/riscv_v.ad), we refer to `AddReductionVI` and implement `AndReductionV`, the main implementation node of AndReduce for the Vector API. > > ``` > instruct reduce_andI(iRegINoSp dst, iRegIorL2I src1, vReg src2, vReg tmp) %{ > predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_INT); > match(Set dst (AndReductionV src1 src2)); > effect(TEMP tmp); > ins_cost(VEC_COST); > format %{ "vmv.s.x $tmp, $src1\t#@reduce_andI\n\t" > "vredand.vs $tmp, $src2, $tmp\n\t" > "vmv.x.s $dst, $tmp" %} > ins_encode %{ > __ vsetvli(t0, x0, Assembler::e32); > __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); > __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), > as_VectorRegister($tmp$$reg)); > __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); > %} > ins_pipe(pipe_slow); > %} > ``` > > The `T_INT` data type is implemented here, and the implementation is given in a different node for `T_BYTE, T_SHORT, T_LONG`. After implementation, the compilation log of the https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int256VectorTests.java#ANDReduceInt256VectorTests method is printed, and RISC-V is enabled. After implementation, the compilation log of the method is printed, and the RISC-V V extension is enabled, so that the execution of the method matches the new AndReductionV node. > > ``` > 27c B21: # out( B25 B22 ) <- in( B20 ) Freq: 32.4376 > 27c # castII of R8, #@castII > 27c addw R7, R8, zr #@convI2L_reg_reg > 280 slli R29, R7, (#2 & 0x3f) #@lShiftL_reg_imm > 284 spill [sp, #24] -> R7 # spill size = 64 > 288 add R7, R7, R29 # ptr, #@addP_reg_reg > 28c addi R7, R7, #16 # ptr, #@addP_reg_imm > 290 vle V2, [R7] #@loadV > 298 .... > 2c0 vmv.s.x V1, R7 #@reduce_andI > vredand.vs V1, V2, V1 > vmv.x.s R28, V1 > ``` > > # Test tips > > 1. After implementing each vector node, write test cases for that node, perform rigorous functional testing, and perform complete testing of the vector in jtreg. > 2. Print the JAVA test case method using the vector node, and analyze the compilation log to confirm that the optimization of the C2 Vector Node is occurring correctly. > 2. We plan to add JMH test cases for each C2 vector node to test the performance comparison before and after adding. > 3. Since no physical machine capable of executing RISC-V V extensions has been found, the above tests were performed with the RISC-V V extensions v1.0 enabled in QEMU. > > # Performance Test > > Continue using https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int256VectorTests.java#ANDReduceInt256VectorTests to Test the performance before and after implementing the RISC-V V extensions added. > > Method ADDReduceInt256VectorTests, ANDReduceInt256VectorTests, ORReduceInt256VectorTests, XORReduceInt256VectorTests, negInt256VectorTests and NEGInt256VectorTests under `test/jdk/jdk/incubator/vector` are tested. The sum of execution time shows ~50.7% reduction on average. > > # Goals and roadmap > > Considering code safety and testing, we plan to implement the Vector API step by step according to the C2 Vector Node types required by the Vector API. For example, we will separate `AndReductionV, OrReductionV, XorReductionV` into one class, `VectorCastB2X, VectorCastS2X, VectorCastD2X` into one class, and so on, and then we will submit PRs upstream according to the C2 Vector Node type. In order to keep the code safe, we will implement the simple vector nodes first, from simple to hard, and avoid modifying other public code in the process for the time being. > After RISC-V's missing vectorization nodes are added, we will adjust and announce the next work plan in time. These are our goals and plans, and we welcome suggestions and corrections from the community. From caogui at iscas.ac.cn Tue Sep 20 06:24:49 2022 From: caogui at iscas.ac.cn (zifeihan) Date: Tue, 20 Sep 2022 14:24:49 +0800 Subject: [vectorIntrinsics] Vector API for RISC-V In-Reply-To: <7027E6B5-79A9-4701-888A-9C2891845D99@oracle.com> References: <09D7332A-09AA-46DF-8EBD-19CA5A484A6D@iscas.ac.cn> <7027E6B5-79A9-4701-888A-9C2891845D99@oracle.com> Message-ID: <661876D7-DA27-4FB4-B8AC-E7DA622A926F@iscas.ac.cn> Hi Paul, Thank you for your reply and suggestions. 1. We will be submitting the code to https://github.com/openjdk/jdk next. 2. We will study and research Valhalla, and look forward to participating in it in the future. 3. We appreciate the JMH benchmark tests provided by the Panama repository, which will make it easier for us to do some performance verification and testing. 4. We have discussed testing IR node generation by assertion before, but we have not found a reasonable way to do it, and we will try to verify it by HotSpot's IR testing framework. Thanks again, zifeihan > On Sep 20, 2022, at 02:34, Paul Sandoz wrote: > > Hi, > > Thank you, very encouraging, and looks a reasonable plan, some suggestions below. Support for the Vector API should more easily result in better support for the auto-vectorizer. > > 1. I think you can submit PRs to https://github.com/openjdk/jdk/ and then those changes can be brought into the Panama repo if need be. That assumes support for RISC-V V extension does not require substantial adjustments to C2 or the API, and from what you say RISC-V does not require such adjustments. > Note: going forward I expect most architectural development to focus on alignment with Valhalla?s value classes/types and support for vector calling conventions. There is also work to research support for FP16 vectors, which is also connected with Valhalla, which can be considered more incremental. > > 2. The Panama repository also has support for generating JMH benchmarks in addition to unit tests, you may find those helpful, rather than writing your own. > Testing-wise I would have liked to revamp the test framework to generate Java tests from a Java code and leverage HotSpot?s IR Test Framework [1]. Alas, I don?t have the time right now. > We could do more to align with HotSpot?s IR framework to not only assert on results, but also assert that C2 IR nodes are generated. (It may be the test generator needs to query the platform for supported vector nodes, via say enhancements to the WhiteBox API). > While JMH performance tests have their place using the IR framework is I think better approach longer term for testing. > > Paul. > > [1] https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/lib/ir_framework/README.md > > >> On Sep 19, 2022, at 7:10 AM, zifeihan wrote: >> >> # Summary >> >> The implementation of vector nodes plays an important role in the implementation of the Vector-API. In the current RISC-V backend implementation of the OpenJDK, some vector nodes have been implemented using the RISC-V V extensions, e.g. `LoadVector,StoreVector,AddVB` and so on. With these vector node implementations, the C2 compiler is able to handle some specific vector computations faster and with better performance. However, the current vector node implementations are still lacking compared to AARCH64's SVE/NEON and X86's avx512, for example: `Op_LoadVectorGather,Op_StoreVectorScatter,AndReductionV` and so on. >> Therefore, we currently want to make more vector node implementations based on RISC-V V extensions for the RISC-V backend of OpenJDK first. >> >> # Status >> >> According to our understanding, the C2 vector node of the RISC-V V extension currently exists to allow the program to use more of the RISC-V V extension during runtime, thus reducing the number of assembly instructions (using a single instruction, multiple data mode), thus allowing for faster execution of the program. Currently, the Vector API works fine on the OpenJDK RISC-V platform, but because some vector nodes are missing, the Vector API C2 mode uses the normal C2 nodes for the unimplemented C2 vector nodes, so that the lack of vectorized nodes does not cause the Vector API to be used in the OpenJDK RISC-V platform. API is not available on the OpenJDK RISC-V platform due to the lack of vectorized nodes. >> >> https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int256VectorTests.java#ANDReduceInt256VectorTests >> >> This test performs AndReduce operations on a set of data. By printing the C2 execution log of the method, we can see that the method also performs C2 compilation, but it is implemented using normal C2 nodes and does not use the RISC-V V extensions. >> >> # Example >> >> The following implementation of AndReduce for the Vector API uses the RISC-V V extension, which provides 32 vector registers and an instruction set to manipulate them. These instruction sets enable vectorization operations similar to AARCH64's SVE, where the RISC-V V extension instruction set precedes operations on vector register data, Some RISC-V V extended instruction sets operate on registers that can contain scalar (normal) registers, for example `vop.vx vd, vs2, rs1, vm # integer vector-scalar vd[i] = vs2[i] op x[rs1]` . For the case where more RISC-V V extension instructions operate on vector registers, the data needs to be loaded into the vector registers first, and then the RISC-V V extension instruction set operates on the vector registers. The Vector API's AndReduce is similar to the existing AddReduce in that it loads data from memory/scalar registers into vector registers, then operates on the vector registers, and finally moves the data to the scalar registers. Since the loading and storage of vector data has already been implemented (src/hotspot/cpu/riscv/riscv_v.ad), we refer to `AddReductionVI` and implement `AndReductionV`, the main implementation node of AndReduce for the Vector API. >> >> ``` >> instruct reduce_andI(iRegINoSp dst, iRegIorL2I src1, vReg src2, vReg tmp) %{ >> predicate(n->in(2)->bottom_type()->is_vect()->element_basic_type() == T_INT); >> match(Set dst (AndReductionV src1 src2)); >> effect(TEMP tmp); >> ins_cost(VEC_COST); >> format %{ "vmv.s.x $tmp, $src1\t#@reduce_andI\n\t" >> "vredand.vs $tmp, $src2, $tmp\n\t" >> "vmv.x.s $dst, $tmp" %} >> ins_encode %{ >> __ vsetvli(t0, x0, Assembler::e32); >> __ vmv_s_x(as_VectorRegister($tmp$$reg), $src1$$Register); >> __ vredand_vs(as_VectorRegister($tmp$$reg), as_VectorRegister($src2$$reg), >> as_VectorRegister($tmp$$reg)); >> __ vmv_x_s($dst$$Register, as_VectorRegister($tmp$$reg)); >> %} >> ins_pipe(pipe_slow); >> %} >> ``` >> >> The `T_INT` data type is implemented here, and the implementation is given in a different node for `T_BYTE, T_SHORT, T_LONG`. After implementation, the compilation log of the https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int256VectorTests.java#ANDReduceInt256VectorTests method is printed, and RISC-V is enabled. After implementation, the compilation log of the method is printed, and the RISC-V V extension is enabled, so that the execution of the method matches the new AndReductionV node. >> >> ``` >> 27c B21: # out( B25 B22 ) <- in( B20 ) Freq: 32.4376 >> 27c # castII of R8, #@castII >> 27c addw R7, R8, zr #@convI2L_reg_reg >> 280 slli R29, R7, (#2 & 0x3f) #@lShiftL_reg_imm >> 284 spill [sp, #24] -> R7 # spill size = 64 >> 288 add R7, R7, R29 # ptr, #@addP_reg_reg >> 28c addi R7, R7, #16 # ptr, #@addP_reg_imm >> 290 vle V2, [R7] #@loadV >> 298 .... >> 2c0 vmv.s.x V1, R7 #@reduce_andI >> vredand.vs V1, V2, V1 >> vmv.x.s R28, V1 >> ``` >> >> # Test tips >> >> 1. After implementing each vector node, write test cases for that node, perform rigorous functional testing, and perform complete testing of the vector in jtreg. >> 2. Print the JAVA test case method using the vector node, and analyze the compilation log to confirm that the optimization of the C2 Vector Node is occurring correctly. >> 2. We plan to add JMH test cases for each C2 vector node to test the performance comparison before and after adding. >> 3. Since no physical machine capable of executing RISC-V V extensions has been found, the above tests were performed with the RISC-V V extensions v1.0 enabled in QEMU. >> >> # Performance Test >> >> Continue using https://github.com/openjdk/jdk/blob/master/test/jdk/jdk/incubator/vector/Int256VectorTests.java#ANDReduceInt256VectorTests to Test the performance before and after implementing the RISC-V V extensions added. >> >> Method ADDReduceInt256VectorTests, ANDReduceInt256VectorTests, ORReduceInt256VectorTests, XORReduceInt256VectorTests, negInt256VectorTests and NEGInt256VectorTests under `test/jdk/jdk/incubator/vector` are tested. The sum of execution time shows ~50.7% reduction on average. >> >> # Goals and roadmap >> >> Considering code safety and testing, we plan to implement the Vector API step by step according to the C2 Vector Node types required by the Vector API. For example, we will separate `AndReductionV, OrReductionV, XorReductionV` into one class, `VectorCastB2X, VectorCastS2X, VectorCastD2X` into one class, and so on, and then we will submit PRs upstream according to the C2 Vector Node type. In order to keep the code safe, we will implement the simple vector nodes first, from simple to hard, and avoid modifying other public code in the process for the time being. >> After RISC-V's missing vectorization nodes are added, we will adjust and announce the next work plan in time. These are our goals and plans, and we welcome suggestions and corrections from the community. > From yunyao.zxl at alibaba-inc.com Tue Sep 20 10:39:37 2022 From: yunyao.zxl at alibaba-inc.com (Xiaolin Zheng) Date: Tue, 20 Sep 2022 18:39:37 +0800 Subject: =?UTF-8?B?UmU6IFJWQyBieSBkZWZhdWx0Pw==?= In-Reply-To: References: <4d02fa41-4c35-4186-bb14-8eca06f33d12.yunyao.zxl@alibaba-inc.com> <1E6F3C09-4F32-4C75-B445-D16ED06B568E@gmail.com> <2f3de868-7c71-40f3-a73f-478e35eb6a68.yunyao.zxl@alibaba-inc.com> <2FFEBB14-77AA-41AC-89B3-89F66607B66D@gmail.com> <30868F59-B23D-438B-BF60-9C124520BC15@gmail.com> <8A383140-8BA7-41CF-9822-1A7933EA6212@gmail.com> <0CA85602-5A58-4B6B-9959-474E6D2F09A4@gmail.com>, Message-ID: <1843ad60-a812-43e6-8848-bf016b17a249.yunyao.zxl@alibaba-inc.com> Hi Vladimir, The result looks good for the unit is time, so more is less. Thank you for the efforts for these evaluations! Best, Xiaolin ------------------------------------------------------------------ From:Vladimir Kempik Send Time:2022?9?20?(???) 05:27 To:???(??) Cc:riscv-port-dev ; Aleksey Shipilev ; riscv-port-dev at openjdk.org Subject:Re: RVC by default? HelloSome numbers on performance, 100% - time of average philosophers run on repetitions 4-9 ( filtering out some too slow results), on default jdk19 without RVC to make results more a less repeatable, I had to "isolate" 2 of 4 harts, from the rest of the system. These two harts can only be assigned to the process explicitly. Basically, two harts are running only java threads. Java threads are running only on these two harts. This minimizes an effect of other processes on the result ( especially on low-clocked fpga cores). jdk19 - 100% +- 4%jdk19 + JDK-8294012 - 100% +- 4% jdk19 + unaligned access patch from Xiaolin + JDK-8294012 - 102 % +- 2.5 % jdk19 + unaligned access patch from Xiaolin + JDK-8294012 + UseRVC - 95% +- 1% Regards, Vladimir 16 ????. 2022 ?., ? 14:49, Vladimir Kempik ???????(?): Hello.I have applied your two fixes on top of jdk19 and tested - all issues have gone, thanks. Looks like the source of the issues is in my m-mode. Kinds Regards, Vladimir 16 ????. 2022 ?., ? 11:45, Xiaolin Zheng ???????(?): Hi Vladimir, Thank you for the newly-added hs_err and further tests. I guess it might have some relationship with the discussed [misaligned issue](https://mail.openjdk.org/pipermail/riscv-port-dev/2022-July/000559.html) here. But just guessing. I have noticed one interesting thing that the crashed hs_err files are often like: ``` ... 0x0000003f7cc2f91e: mv a0,s7 0x0000003f7cc2f922: auipc t0,0x10494 0x0000003f7cc2f926: jalr 664(t0) # 0x0000003f8d0c3bba = AdapterHandlerEntry::print_adapter_on(outputStream*) const+470 <--- ??? Misaligned address: 0x0000003f7cc2f926 0x0000003f7cc2f92a: sd zero,704(s7) 0x0000003f7cc2f92e: sd zero,712(s7) ... ``` With RVC, it is certainly legal to locate at a 2-byte aligned address. But, this location is relocatable, which means it will be patched. So, I might doubt something weird happened when performing the patching behavior. The patching logic, referencing the instruction segment, does not care about the alignment but only performs 4-byte memory load operations, such as Assembler::patch(). On machines having the misaligned address support, it would not go wrong of course; but it seems that our FPGA board discussed lacks this support. So I guess there might be an interesting chemical reaction happening here. I have written two patches to debug this issue and released one fastdebug build only for debugging purposes, and I put it at https://github.com/zhengxiaolinX/jdk/releases/tag/test-unaligned. The two patches fix most of the misaligned accesses to the instruction segment. I was wondering if you could have a simple test of it when you are available, to see whether this issue still exists? I am just guessing the problem here, hope it can reveal something to us. Thank you very much. Best, Xiaolin ------------------------------------------------------------------ From:Vladimir Kempik Send Time:2022?9?15?(???) 23:25 To:???(??) Cc:riscv-port-dev ; Aleksey Shipilev ; riscv-port-dev at openjdk.org Subject:Re: RVC by default? Hello Looks pretty similar to me.for me it was vanilla recent jdk19 But later, when I backported next patches to my jdk19 branch, the issue became different ( Arena alloc issue I have reported earlier): 8290496: riscv: Fix build warnings-as-errors with GCC 11 8290280: riscv: Clean up stack and register handling in interpreter 8290137: riscv: small refactoring for add_memory_int32/64 8290164: compiler/runtime/TestConstantsInError.java fails on riscv 8291952: riscv: Remove PRAGMA_NONNULL_IGNORED 8291947: riscv: fail to build after JDK-8290840 8291893: riscv: remove fence.i used in user space Backport-of:... 8292713: Unsafe.allocateInstance should be intrinsified without UseUnalignedAccesses 8292867: RISC-V: Simplify weak CAS return value handling 8292407: Improve Weak CAS VarHandle/Unsafe tests resilience under spurious failures 8293100: RISC-V: Need to save and restore callee-saved FloatRegisters in... 8293050: RISC-V: Remove redundant non-null assertions about macro-assembler 8293474: RISC-V: Unify the way of moving function pointer 8293524: RISC-V: Use macro-assembler functions as appropriate 8293566: RISC-V: Clean up push and pop registers I?m gonna bisect this list and find what changed the behaviour. The workaround says - update to ubuntu 21.04, but its not clear - update runtime environment or build environment. For me the runtime is ubuntu 22.04, but I build the jdk with sysroot of ubuntu 20.04 ( for better compatibility) and gcc 11.2 Regards, Vladimir15 ????. 2022 ?., ? 18:17, Xiaolin Zheng ???????(?): Hi Vladimir, The mailing list says my e-mail exceeds 40KB so I get rejected. But I want to send it out anyway before getting off today's work. So here is a work around: Please check: https://gist.github.com/zhengxiaolinX/25c32853690f7ac1c125d2fe1da19710 Looking forward to your opinions. Best, Xiaolin From yunyao.zxl at alibaba-inc.com Tue Sep 20 10:44:21 2022 From: yunyao.zxl at alibaba-inc.com (Xiaolin Zheng) Date: Tue, 20 Sep 2022 18:44:21 +0800 Subject: =?UTF-8?B?UmU6IERpc2N1c3MgdGhlIFJWQyBpbXBsZW1lbnRhdGlvbg==?= In-Reply-To: <42bdf74a.1322.1834b9400ed.Coremail.yangfei@iscas.ac.cn> References: <2d7bbad2-7ade-4b38-91b5-12c4c0a91602.yunyao.zxl@alibaba-inc.com>, <42bdf74a.1322.1834b9400ed.Coremail.yangfei@iscas.ac.cn> Message-ID: <8b9f82ef-6bca-4165-840c-3c06173d2207.yunyao.zxl@alibaba-inc.com> Hi Felix, TL;DR of code size evaluations, stably reproduced: If a piece of code is 100 bytes full of 4-byte instructions: 1. In the current master branch with RVC, it may shrink to 95 bytes. (compression rate is %5) 2. With the new implementation at [1], it may shrink to 84 bytes. (compression rate is 16%; ~11% more than master) 3. With the special patch at [2] (a special optimization of compressing two "slli"s in the movptr), it may shrink to 79 bytes. (compression rate is 21%; ~%5 addition to the previous one, because movptr() is used in a quite big quantity. But this patch might need further beautification for the hard-coded enumeration and will cause complexity for reviewing, so we'd postpone that temporarily) These are evaluated by a hand-written toy histogram[3], excluding the scratch_emit, and tested with release build (for fastdebug build, the compression rate is far more than release mode; but we may not care about that), only for evaluation purposes. About the performance, I need more time to make some more evaluations. Due to the patch of the new implementation should wait for the loom port merging first, we have plenty of time then. I am going to make a long run of specjbb2015 to measure it on average. Will update the result in the same thread. --------------- Precisely, here are some detailed data about the code size. This histogram mentioned above presents all the instructions emitted in a JVM process, shown when exiting. For example, the picture in [4]. The second row (RVC instructions) + The third row (4-byte normal instructions) = The fourth row (total instructions); sorted by the fourth row. If RVC is not enabled, the second row is always 0 and the third row is always equal to the fourth row. Tested with the new RVC [1] branch with springboot / springboot-petclinic / SPECjvm2008 / SPECjbb2015(when exiting), the results are all a stable ~84%. The SPECjvm2008 results are at [5]. Please search the keywords "Ideally Code Size Could Shrink to" in the files in the browser for more details. P.S.: the result with the special patch [2] is about ~79% at [6] for future references, but might be reserved for now. Best, Xiaolin [1] https://github.com/zhengxiaolinX/jdk/commits/REBASE-rvc-beautify [2] https://github.com/zhengxiaolinX/jdk/commit/3a4d80197da0c497c844016b9a9fbae541eca9c8 [3] https://github.com/zhengxiaolinX/jdk/commit/5312cbd8ac860f47b109ab2a99750041865c018d [4] https://github.com/openjdk/riscv-port/pull/34 [5] http://cr.openjdk.java.net/~xlinzheng/rvc-size/size/ [6] http://cr.openjdk.java.net/~xlinzheng/rvc-size/size-full/ ------------------------------------------------------------------ From:yangfei Send Time:2022?9?17?(???) 21:12 To:???(??) Cc:riscv-port-dev Subject:Re: Discuss the RVC implementation Hi Xiaolin, Your new proposal for supporting the RVC extension looks interesting. May I ask if you have any performance data including code size measured? Also it's appreciated if you have more details about the issue with MachBranch nodes. Thanks, Fei -----Original Messages----- From:"Xiaolin Zheng" Sent Time:2022-09-15 10:52:59 (Thursday) To: riscv-port-dev Cc: Subject: Discuss the RVC implementation Hi team, I am going to describe a different implementation of RVC for our backend. ## Background The RISC-V C extension, also known as RVC, could transform 4-byte instructions to 2-byte counterparts when eligible (for example, as the manual, Rd/Rs of instruction ranges from [x8,x15] might be one common requirement, etc.). ## The current implementation in the Hotspot The current implementation[0] is a transient one, introducing a "CompressibleRegion" by using RTTI[1] to indicate that instructions inside these regions can be safely substituted by the RVC counterparts, if convertible; and the implementation also uses a, say, "whitelist mode" by using the "CompressibleRegion" mentioned above to "manually mark out safe regions", then batch emit them if could. However, after a deeper look, we might discover the current "whitelist mode" has several shortages: ## Shortages of the current implementation 1. Coverages: The current implementation only covers some of C2 match rules, and only some small part of stub code, so there is obviously far more space to reduce the total code size. In my observations, some RISC-V instruction sequences generally occupy a bit more space than AArch64 ones[2]. With the new implementations, we could achieve a code size level alike AArch64's generated code. Some better, some still worse than AArch64 one in my simple observation. 2. Though safe, I'd say it's very much not easy to maintain. The background is, most of the patchable instructions cannot be easily transformed into their shorter counterparts[3], and they need to be prevented from being compressed. So comes the question: we must make sure no patchable relocation is inside the range of a "CompressibleRegion". For example, the string comparison intrinsic function[4] looks very delicious: transforming it and its siblings may result in a yummy compression rate. But programmers might have to check lots of its callees to find if there is just one patchable relocation hidden inside that causes the whole intrinsic incompressible. This could cause extra burden for programmers, so I bet no one would like to add "CompressibleRegion" for his/her code :-) 3. Performance: Better performance of generated code is a little side effect this extension gives us, the smaller the I$ size, the better performance though - please see Andrew Waterman's paper[5] for more reference there. Anyway, it looks like a higher general compression rate is better for performance. The main issue here is the granularity of "CompressibleRegion" is a bit coarse. "Why not exclude the incompressible parts" may come up to us naturally. And after some diggings, we may find: we just need to exclude countable places that would be patched back (mostly relocations), and several code slices with a fixed length, which will be calculated, such as "emit_static_call_stub". All remaining instructions could be safely transformed into RVC counterparts if eligible. So maybe, say, the "blacklist mode"? ## The new implementation To implement the "blacklist mode" in the backend, we need two things: 1. an "IncompressibleRegion", indicating instructions inside it should remain in their normal 4-byte form no matter what happens. 2. a simple strategy to exclude patchable instructions, mainly for relocations. So we can see the new strategy is highly bounded to relocations' positions: We all know the "relocate()" in Hotspot VM is a mark that only has an explicit "start point" without an end point, and some of them could be patched back. Therefore, we can use a simple strategy: introduce a lambda as another argument to assign "end point" semantics to the relocations, for completing our requirements without extra costs. For example: Originally: ``` __ relocate(safepoint_pc.rspec()); __ la(t0, safepoint_pc.target()); __ sd(t0, Address(xthread, JavaThread::saved_exception_pc_offset())); ``` After introducing a simple lambda as an extra argument: ``` __ relocate(safepoint_pc.rspec(), [&] { // The relocate() hides an "IncompressibleRegion" in it __ la(t0, safepoint_pc.target()); // This patchable instruction sequence is incompressible }); _ sd(t0, Address(xthread, JavaThread::saved_exception_pc_offset())); ``` Well, simple but effective. Excluding such countable dynamically patchable places and unifying all relocations, all other instructions can be safely transformed, without messing up the current code style. Programmers could just keep aligning the same style; most of the time they have no need to care about whether the RVC exists or not and things get converted automatically. The proposed new sample code is again, here[6]. ## Other things worth being noticed 1. Instruction patching issues With the C extension, the backend mixes with both 2-byte and 4-byte instructions. It gets a little CISC alike. We know the Hotspot would patch instructions when code is running at full speed, such as call instructions, nops used for deoptimizations (the nops at the entry points, and post-call nops after loom). Instruction patching is delicate so we must carefully handle such places, to keep these 4-byte instructions from spanning cachelines. Though remaining a 4-byte normal form even with RVC, they might sit at a 2-byte aligned boundary. Such cases should definitely not happen, for patching such places spanning cachelines would lose the atomicity. So shortly, we must properly align them, such as [7][8]. Such a problem could exist with RVC, no matter "whitelist mode" or "blacklist mode". It is a general problem for instruction patching. I will add more strong assertions to the potential places (trampoline_call might be a very good spot, for patchable "static_call", "opt_virtual" and "virtual" relocations) to check alignment in the future patches. 2. MachBranch Nodes And MachBranch nodes: they are not easy to be tamed because the "fake label"[9] in PhaseOutput::scratch_emit_size() cannot tell us the real distance of the label. But we can leave them alone in this discussion, for there will be patches to handle those afterward. That's nearly all. Thanks for reaching here despite the verbosity. It would be very nice to receive any suggestions. Best, Xiaolin [0] Original patch: https://github.com/openjdk/riscv-port/pull/34 [1] Of course, the "CompressibleRegion" is good, I like it; and this idea is not from myself. [2] For a simple example, a much commonly used fixed-length movptr() uses up six 4-byte instructions (lui+addi+slli+addi+slli+addi, MIPS alike instructions using arithmetical calculations with signed extensions, but not anyone's fault :-) ), while the AArch64 counterpart only takes three 4-byte instructions (movz+movk+movk). They are both going to mov a 48-bit immediate. After accumulation, the size differs quite a lot. [3] 2-byte instructions have fewer bits, so comes shorter immediate encoding etc. compared to the 4-byte counterparts. After we transform patchable instructions (ones at marks of patchable relocations, etc.) to 2-byte ones, when they are patched to a larger value or farther distances afterward, it is possible that they sadly find themselves, the shorter instructions, cannot cover the newly patched value. So we need to exclude patchable instructions (at the relocation marks etc.) from being compressed. [4] https://github.com/openjdk/jdk/blob/7f3250d71c4866a64eb73f52140c669fe90f122f/src/hotspot/cpu/riscv/riscv.ad#L10032-L10035 [5] https://digitalassets.lib.berkeley.edu/etd/ucb/text/Waterman_berkeley_0028E_15908.pdf, Page 64: "5.4 The RVC Extension, Performance Implications" [6] https://github.com/zhengxiaolinX/jdk/tree/REBASE-rvc-beautify [7] https://github.com/openjdk/jdk/blob/7f3250d71c4866a64eb73f52140c669fe90f122f/src/hotspot/cpu/riscv/riscv.ad#L9873 [8] https://github.com/openjdk/jdk/blob/7f3250d71c4866a64eb73f52140c669fe90f122f/src/hotspot/cpu/riscv/c1_LIRAssembler_riscv.cpp#L1348-L1353 [9] https://github.com/openjdk/jdk/blob/211fab8d361822bbd1a34a88626853bf4a029af5/src/hotspot/share/opto/output.cpp#L3331-L3340 From yangfei at iscas.ac.cn Tue Sep 20 11:48:35 2022 From: yangfei at iscas.ac.cn (yangfei at iscas.ac.cn) Date: Tue, 20 Sep 2022 19:48:35 +0800 (GMT+08:00) Subject: Discuss the RVC implementation In-Reply-To: <8b9f82ef-6bca-4165-840c-3c06173d2207.yunyao.zxl@alibaba-inc.com> References: <2d7bbad2-7ade-4b38-91b5-12c4c0a91602.yunyao.zxl@alibaba-inc.com>, <42bdf74a.1322.1834b9400ed.Coremail.yangfei@iscas.ac.cn> <8b9f82ef-6bca-4165-840c-3c06173d2207.yunyao.zxl@alibaba-inc.com> Message-ID: Hi Xiaolin, > -----Original Messages----- > From: "Xiaolin Zheng" > Sent Time: 2022-09-20 18:44:21 (Tuesday) > To: yangfei > Cc: riscv-port-dev > Subject: Re: Discuss the RVC implementation > > Hi Felix, > > TL;DR of code size evaluations, stably reproduced: > > If a piece of code is 100 bytes full of 4-byte instructions: > 1. In the current master branch with RVC, it may shrink to 95 bytes. (compression rate is %5) > 2. With the new implementation at [1], it may shrink to 84 bytes. (compression rate is 16%; ~11% more than master) > 3. With the special patch at [2] (a special optimization of compressing two "slli"s in the movptr), it may shrink to 79 bytes. (compression rate is 21%; ~%5 addition to the previous one, because movptr() is used in a quite big quantity. But this patch might need further beautification for the hard-coded enumeration and will cause complexity for reviewing, so we'd postpone that temporarily) > These are evaluated by a hand-written toy histogram[3], excluding the scratch_emit, and tested with release build (for fastdebug build, the compression rate is far more than release mode; but we may not care about that), only for evaluation purposes. > > About the performance, I need more time to make some more evaluations. Due to the patch of the new implementation should wait for the loom port merging first, we have plenty of time then. I am going to make a long run of specjbb2015 to measure it on average. Will update the result in the same thread. > > --------------- > > Precisely, here are some detailed data about the code size. > > This histogram mentioned above presents all the instructions emitted in a JVM process, shown when exiting. For example, the picture in [4]. > > The second row (RVC instructions) + The third row (4-byte normal instructions) = The fourth row (total instructions); sorted by the fourth row. > If RVC is not enabled, the second row is always 0 and the third row is always equal to the fourth row. > > Tested with the new RVC [1] branch with springboot / springboot-petclinic / SPECjvm2008 / SPECjbb2015(when exiting), the results are all a stable ~84%. The SPECjvm2008 results are at [5]. Please search the keywords "Ideally Code Size Could Shrink to" in the files in the browser for more details. > > P.S.: the result with the special patch [2] is about ~79% at [6] for future references, but might be reserved for now. > > Best, > Xiaolin > > [1] https://github.com/zhengxiaolinX/jdk/commits/REBASE-rvc-beautify > [2] https://github.com/zhengxiaolinX/jdk/commit/3a4d80197da0c497c844016b9a9fbae541eca9c8 > [3] https://github.com/zhengxiaolinX/jdk/commit/5312cbd8ac860f47b109ab2a99750041865c018d > [4] https://github.com/openjdk/riscv-port/pull/34 > [5] http://cr.openjdk.java.net/~xlinzheng/rvc-size/size/ > [6] http://cr.openjdk.java.net/~xlinzheng/rvc-size/size-full/ Thanks for taking the time measuring those figures :-) It's great to know that your new proposal for supporting RVC works better in respect of codesize metric. I am currently looking at the details of your code changes at [1]. I just realized that your work bears some code cleanup in the first two commits. I would suggest we upstream those code cleanup first if possible. Regards, Fei [1] https://github.com/zhengxiaolinX/jdk/commits/REBASE-rvc-beautify From yunyao.zxl at alibaba-inc.com Tue Sep 20 12:00:03 2022 From: yunyao.zxl at alibaba-inc.com (Xiaolin Zheng) Date: Tue, 20 Sep 2022 20:00:03 +0800 Subject: =?UTF-8?B?UmU6IFJlOiBEaXNjdXNzIHRoZSBSVkMgaW1wbGVtZW50YXRpb24=?= In-Reply-To: References: <2d7bbad2-7ade-4b38-91b5-12c4c0a91602.yunyao.zxl@alibaba-inc.com>, <42bdf74a.1322.1834b9400ed.Coremail.yangfei@iscas.ac.cn> <8b9f82ef-6bca-4165-840c-3c06173d2207.yunyao.zxl@alibaba-inc.com>, Message-ID: <63c8d22d-5703-413a-bb8d-6378fd3da69b.yunyao.zxl@alibaba-inc.com> Hi Felix, Thanks for the advice. The main patch is only the one marked by "[3]" indeed. The "[1]" and "[2]" are actually not related so much. So will do it then. Best, Xiaolin ------------------------------------------------------------------ From:yangfei Send Time:2022?9?20?(???) 19:48 To:???(??) Cc:riscv-port-dev Subject:Re: Re: Discuss the RVC implementation Hi Xiaolin, > -----Original Messages----- > From: "Xiaolin Zheng" > Sent Time: 2022-09-20 18:44:21 (Tuesday) > To: yangfei > Cc: riscv-port-dev > Subject: Re: Discuss the RVC implementation > > Hi Felix, > > TL;DR of code size evaluations, stably reproduced: > > If a piece of code is 100 bytes full of 4-byte instructions: > 1. In the current master branch with RVC, it may shrink to 95 bytes. (compression rate is %5) > 2. With the new implementation at [1], it may shrink to 84 bytes. (compression rate is 16%; ~11% more than master) > 3. With the special patch at [2] (a special optimization of compressing two "slli"s in the movptr), it may shrink to 79 bytes. (compression rate is 21%; ~%5 addition to the previous one, because movptr() is used in a quite big quantity. But this patch might need further beautification for the hard-coded enumeration and will cause complexity for reviewing, so we'd postpone that temporarily) > These are evaluated by a hand-written toy histogram[3], excluding the scratch_emit, and tested with release build (for fastdebug build, the compression rate is far more than release mode; but we may not care about that), only for evaluation purposes. > > About the performance, I need more time to make some more evaluations. Due to the patch of the new implementation should wait for the loom port merging first, we have plenty of time then. I am going to make a long run of specjbb2015 to measure it on average. Will update the result in the same thread. > > --------------- > > Precisely, here are some detailed data about the code size. > > This histogram mentioned above presents all the instructions emitted in a JVM process, shown when exiting. For example, the picture in [4]. > > The second row (RVC instructions) + The third row (4-byte normal instructions) = The fourth row (total instructions); sorted by the fourth row. > If RVC is not enabled, the second row is always 0 and the third row is always equal to the fourth row. > > Tested with the new RVC [1] branch with springboot / springboot-petclinic / SPECjvm2008 / SPECjbb2015(when exiting), the results are all a stable ~84%. The SPECjvm2008 results are at [5]. Please search the keywords "Ideally Code Size Could Shrink to" in the files in the browser for more details. > > P.S.: the result with the special patch [2] is about ~79% at [6] for future references, but might be reserved for now. > > Best, > Xiaolin > > [1] https://github.com/zhengxiaolinX/jdk/commits/REBASE-rvc-beautify > [2] https://github.com/zhengxiaolinX/jdk/commit/3a4d80197da0c497c844016b9a9fbae541eca9c8 > [3] https://github.com/zhengxiaolinX/jdk/commit/5312cbd8ac860f47b109ab2a99750041865c018d > [4] https://github.com/openjdk/riscv-port/pull/34 > [5] http://cr.openjdk.java.net/~xlinzheng/rvc-size/size/ > [6] http://cr.openjdk.java.net/~xlinzheng/rvc-size/size-full/ Thanks for taking the time measuring those figures :-) It's great to know that your new proposal for supporting RVC works better in respect of codesize metric. I am currently looking at the details of your code changes at [1]. I just realized that your work bears some code cleanup in the first two commits. I would suggest we upstream those code cleanup first if possible. Regards, Fei [1] https://github.com/zhengxiaolinX/jdk/commits/REBASE-rvc-beautify From zixian.cai at anu.edu.au Fri Sep 23 08:38:48 2022 From: zixian.cai at anu.edu.au (Zixian Cai) Date: Fri, 23 Sep 2022 08:38:48 +0000 Subject: Non-zero build crash on kernel 5.17+? Message-ID: Hi all, I found that a non-zero build of jdk-20+16 crashes on Ubuntu 22.10 (kernel 5.19) running on QEMU. The same build works on Ubuntu 22.04 (kernel 5.15) running on QEMU. The error message is as follows. # To suppress the following error report, specify this argument # after -XX: or in .hotspotrc: SuppressErrorAt=/assembler_riscv.cpp:285 # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (/home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/cpu/riscv/assembler_riscv.cpp:285), pid=907, tid=908 # assert(is_unsigned_imm_in_range(imm64, 47, 0) || (imm64 == (int64_t)-1)) failed: bit 47 overflows in address constant # # JRE version: (20.0) (slowdebug build ) # Java VM: OpenJDK 64-Bit Server VM (slowdebug 20-testing-builds.shipilev.net-openjdk-jdk-b212-20220922, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-riscv64) # Problematic frame: # V [libjvm.so+0x39f41c] Assembler::movptr_with_offset(Register, unsigned char*, int&)+0x96 # # Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /home/ubuntu/core.907) # # An error report file with more information is saved as: # /home/ubuntu/hs_err_pid907.log # # Here is the backtrace and local variables seen in gdb. (gdb) bt #0 0x00fffffff674941c in Assembler::movptr_with_offset (this=0xfffffff0000e30, Rd=..., addr=0xfffffff71136b8 "9q\006\374\"\370", , offset=@0xfffffff632f00c: 0) at /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/cpu/riscv/assembler_riscv.cpp:284 #1 0x00fffffff6f17c5c in MacroAssembler::call_VM_leaf_base (this=0xfffffff0000e30, entry_point=0xfffffff71136b8 "9q\006\374\"\370", , number_of_arguments=2, retaddr=0x0) at /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp:568 #2 0x00fffffff6f17da2 in MacroAssembler::call_VM_leaf (this=0xfffffff0000e30, entry_point=0xfffffff71136b8 "9q\006\374\"\370", , arg_0=..., arg_1=...) at /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp:588 #3 0x00fffffff7222308 in StubGenerator::generate_forward_exception (this=0xfffffff632f1e8) at /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:546 #4 0x00fffffff7231506 in StubGenerator::generate_initial (this=0xfffffff632f1e8) at /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3870 #5 0x00fffffff7231956 in StubGenerator::StubGenerator (this=0xfffffff632f1e8, code=0xfffffff632f3c8, phase=0) at /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3971 #6 0x00fffffff721faa0 in StubGenerator_generate (code=0xfffffff632f3c8, phase=0) at /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3988 #7 0x00fffffff72322c8 in StubRoutines::initialize1 () at /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/share/runtime/stubRoutines.cpp:228 #8 0x00fffffff72330d2 in stubRoutines_init1 () at /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/share/runtime/stubRoutines.cpp:389 #9 0x00fffffff6c7823a in init_globals () at /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/share/runtime/init.cpp:123 #10 0x00fffffff72bcc34 in Threads::create_vm (args=0xfffffff632f7e0, canTryAgain=0xfffffff632f70b) at /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/share/runtime/threads.cpp:570 #11 0x00fffffff6d891ae in JNI_CreateJavaVM_inner (vm=0xfffffff632f838, penv=0xfffffff632f840, args=0xfffffff632f7e0) at /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/share/prims/jni.cpp:3628 #12 0x00fffffff6d893a8 in JNI_CreateJavaVM (vm=0xfffffff632f838, penv=0xfffffff632f840, args=0xfffffff632f7e0) at /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/share/prims/jni.cpp:3714 #13 0x00fffffff7fb1a44 in InitializeJVM (pvm=0xfffffff632f838, penv=0xfffffff632f840, ifn=0xfffffff632f890) at /home/buildbot/worker/build-jdkX-debian10/build/src/java.base/share/native/libjli/java.c:1457 #14 0x00fffffff7faef16 in JavaMain (_args=0xffffffffffc0d8) at /home/buildbot/worker/build-jdkX-debian10/build/src/java.base/share/native/libjli/java.c:413 #15 0x00fffffff7fb50ea in ThreadJavaMain (args=0xffffffffffc0d8) at /home/buildbot/worker/build-jdkX-debian10/build/src/java.base/unix/native/libjli/java_md.c:650 #16 0x00fffffff7ed7450 in start_thread (arg=) at ./nptl/pthread_create.c:442 #17 0x00fffffff7f24ed2 in __thread_start () at ../sysdeps/unix/sysv/linux/riscv/clone.S:85 (gdb) info locals imm64 = 0xfffffff71136b8 imm = 0xfffffff632efb0 upper = 0xfffffff632efb0 lower = 0xffffff80000000 I suspect that the issue is due to the newer kernels (5.17+) supports sv48, and that increases the bits in the addresses that the assembler needs to handle. See kernel changelog https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.17. To reproduce the issue, I use the following. Guest Ubuntu 22.10: https://cdimage.ubuntu.com/ubuntu-server/daily-preinstalled/current/kinetic-preinstalled-server-riscv64+unmatched.img.xz Guest Ubuntu 22.04: https://cdimage.ubuntu.com/releases/22.04.1/release/ubuntu-22.04.1-preinstalled-server-riscv64+unmatched.img.xz JDK slowdebug build: https://builds.shipilev.net/openjdk-jdk/openjdk-jdk-linux-riscv64-server-slowdebug-gcc8-glibc2.28.tar.xz (OpenJDK 64-Bit Server VM (slowdebug build 20-testing-builds.shipilev.net-openjdk-jdk-b212-20220922, mixed mode)) QEMU: installed via apt on Ubuntu 22.04 host QEMU setup: https://wiki.ubuntu.com/RISC-V Sincerely, Zixian -------------- next part -------------- An HTML attachment was scrubbed... URL: From yunyao.zxl at alibaba-inc.com Fri Sep 23 11:16:28 2022 From: yunyao.zxl at alibaba-inc.com (Xiaolin Zheng) Date: Fri, 23 Sep 2022 19:16:28 +0800 Subject: =?UTF-8?B?UmU6IE5vbi16ZXJvIGJ1aWxkIGNyYXNoIG9uIGtlcm5lbCA1LjE3Kz8=?= In-Reply-To: References: Message-ID: Hi Zixian, The current backend supports sv48 and below only. Please see [1] for more details. The kernel 5.17 supports sv48 and 5.18 supports sv57. Your address `0xfffffff71136b8` is a 56-bit address, which is not supported by the backend currently. To bypass this issue, you can try to use kernel 5.17 directly or find if there are options for QEMU to limit the address space to an sv48 one. Not sure if there will be support for a larger address space recently in the backend, for there seems no hardware supporting even sv48 now. Thanks, Xiaolin [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/macroAssembler_riscv.hpp#L910-L914 ------------------------------------------------------------------ From:Zixian Cai Send Time:2022?9?23?(???) 16:57 To:riscv-port-dev at openjdk.org Subject:Non-zero build crash on kernel 5.17+? Hi all, I found that a non-zero build of jdk-20+16 crashes on Ubuntu 22.10 (kernel 5.19) running on QEMU. The same build works on Ubuntu 22.04 (kernel 5.15) running on QEMU. The error message is as follows. # To suppress the following error report, specify this argument # after -XX: or in .hotspotrc: SuppressErrorAt=/assembler_riscv.cpp:285 # # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (/home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/cpu/riscv/assembler_riscv.cpp:285), pid=907, tid=908 # assert(is_unsigned_imm_in_range(imm64, 47, 0) || (imm64 == (int64_t)-1)) failed: bit 47 overflows in address constant # # JRE version: (20.0) (slowdebug build ) # Java VM: OpenJDK 64-Bit Server VM (slowdebug 20-testing-builds.shipilev.net-openjdk-jdk-b212-20220922, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-riscv64) # Problematic frame: # V [libjvm.so+0x39f41c] Assembler::movptr_with_offset(Register, unsigned char*, int&)+0x96 # # Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /home/ubuntu/core.907) # # An error report file with more information is saved as: # /home/ubuntu/hs_err_pid907.log # # Here is the backtrace and local variables seen in gdb. (gdb) bt #0 0x00fffffff674941c in Assembler::movptr_with_offset (this=0xfffffff0000e30, Rd=..., addr=0xfffffff71136b8 "9q\006\374\"\370", , offset=@0xfffffff632f00c: 0) at /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/cpu/riscv/assembler_riscv.cpp:284 #1 0x00fffffff6f17c5c in MacroAssembler::call_VM_leaf_base (this=0xfffffff0000e30, entry_point=0xfffffff71136b8 "9q\006\374\"\370", , number_of_arguments=2, retaddr=0x0) at /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp:568 #2 0x00fffffff6f17da2 in MacroAssembler::call_VM_leaf (this=0xfffffff0000e30, entry_point=0xfffffff71136b8 "9q\006\374\"\370", , arg_0=..., arg_1=...) at /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp:588 #3 0x00fffffff7222308 in StubGenerator::generate_forward_exception (this=0xfffffff632f1e8) at /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:546 #4 0x00fffffff7231506 in StubGenerator::generate_initial (this=0xfffffff632f1e8) at /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3870 #5 0x00fffffff7231956 in StubGenerator::StubGenerator (this=0xfffffff632f1e8, code=0xfffffff632f3c8, phase=0) at /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3971 #6 0x00fffffff721faa0 in StubGenerator_generate (code=0xfffffff632f3c8, phase=0) at /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3988 #7 0x00fffffff72322c8 in StubRoutines::initialize1 () at /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/share/runtime/stubRoutines.cpp:228 #8 0x00fffffff72330d2 in stubRoutines_init1 () at /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/share/runtime/stubRoutines.cpp:389 #9 0x00fffffff6c7823a in init_globals () at /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/share/runtime/init.cpp:123 #10 0x00fffffff72bcc34 in Threads::create_vm (args=0xfffffff632f7e0, canTryAgain=0xfffffff632f70b) at /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/share/runtime/threads.cpp:570 #11 0x00fffffff6d891ae in JNI_CreateJavaVM_inner (vm=0xfffffff632f838, penv=0xfffffff632f840, args=0xfffffff632f7e0) at /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/share/prims/jni.cpp:3628 #12 0x00fffffff6d893a8 in JNI_CreateJavaVM (vm=0xfffffff632f838, penv=0xfffffff632f840, args=0xfffffff632f7e0) at /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/share/prims/jni.cpp:3714 #13 0x00fffffff7fb1a44 in InitializeJVM (pvm=0xfffffff632f838, penv=0xfffffff632f840, ifn=0xfffffff632f890) at /home/buildbot/worker/build-jdkX-debian10/build/src/java.base/share/native/libjli/java.c:1457 #14 0x00fffffff7faef16 in JavaMain (_args=0xffffffffffc0d8) at /home/buildbot/worker/build-jdkX-debian10/build/src/java.base/share/native/libjli/java.c:413 #15 0x00fffffff7fb50ea in ThreadJavaMain (args=0xffffffffffc0d8) at /home/buildbot/worker/build-jdkX-debian10/build/src/java.base/unix/native/libjli/java_md.c:650 #16 0x00fffffff7ed7450 in start_thread (arg=) at ./nptl/pthread_create.c:442 #17 0x00fffffff7f24ed2 in __thread_start () at ../sysdeps/unix/sysv/linux/riscv/clone.S:85 (gdb) info locals imm64 = 0xfffffff71136b8 imm = 0xfffffff632efb0 upper = 0xfffffff632efb0 lower = 0xffffff80000000 I suspect that the issue is due to the newer kernels (5.17+) supports sv48, and that increases the bits in the addresses that the assembler needs to handle. See kernel changelog https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.17 . To reproduce the issue, I use the following. Guest Ubuntu 22.10: https://cdimage.ubuntu.com/ubuntu-server/daily-preinstalled/current/kinetic-preinstalled-server-riscv64+unmatched.img.xz Guest Ubuntu 22.04: https://cdimage.ubuntu.com/releases/22.04.1/release/ubuntu-22.04.1-preinstalled-server-riscv64+unmatched.img.xz JDK slowdebug build: https://builds.shipilev.net/openjdk-jdk/openjdk-jdk-linux-riscv64-server-slowdebug-gcc8-glibc2.28.tar.xz (OpenJDK 64-Bit Server VM (slowdebug build 20-testing-builds.shipilev.net-openjdk-jdk-b212-20220922, mixed mode)) QEMU: installed via apt on Ubuntu 22.04 host QEMU setup: https://wiki.ubuntu.com/RISC-V Sincerely, Zixian -------------- next part -------------- An HTML attachment was scrubbed... URL: From yunyao.zxl at alibaba-inc.com Fri Sep 23 13:11:57 2022 From: yunyao.zxl at alibaba-inc.com (Xiaolin Zheng) Date: Fri, 23 Sep 2022 21:11:57 +0800 Subject: =?UTF-8?B?UmU6IFJlOiBEaXNjdXNzIHRoZSBSVkMgaW1wbGVtZW50YXRpb24=?= In-Reply-To: References: <2d7bbad2-7ade-4b38-91b5-12c4c0a91602.yunyao.zxl@alibaba-inc.com>, <42bdf74a.1322.1834b9400ed.Coremail.yangfei@iscas.ac.cn> <8b9f82ef-6bca-4165-840c-3c06173d2207.yunyao.zxl@alibaba-inc.com>, Message-ID: <9541f786-83a9-4f76-8921-a20a56c6b932.yunyao.zxl@alibaba-inc.com> I forgot to describe something about MachBranchNodes. The thing is, C2 needs to calculate node sizes to allocate buffers, so it has a scratch_emit phase to estimate node size first. It uses a clever strategy to measure MachBranchNodes' size. When estimating the size, we could find only the MacnBranchNode itself matters, not the Label. The labels are just tools for generating branch instructions. So there has "fake label"[1] instead, directly placed at the same pc as the MachBranchNode's to simplify code logic. On other platforms like x86 and aarch64, the size of branch instructions is not changed, and these platforms don't have a code size reduction extension as RISC-V. For example, on other platforms, the jcc is jcc, and the bl is bl. In our implementation, we have: ``` #define INSN(NAME) \ void NAME(Register Rd, const int32_t offset) { \ /* jal -> c.j */ \ if (do_compress() ...) { \ c_j(offset); \ return; \ } \ _jal(Rd, offset); \ } INSN(jal); #undef INSN ``` The size of an emitted instruction is determined by the `offset`. Though reasonable, it is not compatible with the "fake label" strategy. For example, with the "fake label", the offset is always 0 when scratch-emitting a MachBranchNode. The offset does not match the real offset. Therefore, In scratch_emit and the real emission, the size of MachBranchNode might be different, which will break the assumption of C2's strategy. To emit the code that we want, a basic approach is to pass the real offset into the MachBranchNode, and let us read it instead of the "0" every time. So currently in these patches, all MachBranchNodes are temporarily incompressible in C2 when RVC is enabled. Thanks, Xiaolin ------------------------------------------------------------------ From:yangfei Send Time:2022?9?20?(???) 19:48 To:???(??) Cc:riscv-port-dev Subject:Re: Re: Discuss the RVC implementation Hi Xiaolin, > -----Original Messages----- > From: "Xiaolin Zheng" > Sent Time: 2022-09-20 18:44:21 (Tuesday) > To: yangfei > Cc: riscv-port-dev > Subject: Re: Discuss the RVC implementation > > Hi Felix, > > TL;DR of code size evaluations, stably reproduced: > > If a piece of code is 100 bytes full of 4-byte instructions: > 1. In the current master branch with RVC, it may shrink to 95 bytes. (compression rate is %5) > 2. With the new implementation at [1], it may shrink to 84 bytes. (compression rate is 16%; ~11% more than master) > 3. With the special patch at [2] (a special optimization of compressing two "slli"s in the movptr), it may shrink to 79 bytes. (compression rate is 21%; ~%5 addition to the previous one, because movptr() is used in a quite big quantity. But this patch might need further beautification for the hard-coded enumeration and will cause complexity for reviewing, so we'd postpone that temporarily) > These are evaluated by a hand-written toy histogram[3], excluding the scratch_emit, and tested with release build (for fastdebug build, the compression rate is far more than release mode; but we may not care about that), only for evaluation purposes. > > About the performance, I need more time to make some more evaluations. Due to the patch of the new implementation should wait for the loom port merging first, we have plenty of time then. I am going to make a long run of specjbb2015 to measure it on average. Will update the result in the same thread. > > --------------- > > Precisely, here are some detailed data about the code size. > > This histogram mentioned above presents all the instructions emitted in a JVM process, shown when exiting. For example, the picture in [4]. > > The second row (RVC instructions) + The third row (4-byte normal instructions) = The fourth row (total instructions); sorted by the fourth row. > If RVC is not enabled, the second row is always 0 and the third row is always equal to the fourth row. > > Tested with the new RVC [1] branch with springboot / springboot-petclinic / SPECjvm2008 / SPECjbb2015(when exiting), the results are all a stable ~84%. The SPECjvm2008 results are at [5]. Please search the keywords "Ideally Code Size Could Shrink to" in the files in the browser for more details. > > P.S.: the result with the special patch [2] is about ~79% at [6] for future references, but might be reserved for now. > > Best, > Xiaolin > > [1] https://github.com/zhengxiaolinX/jdk/commits/REBASE-rvc-beautify > [2] https://github.com/zhengxiaolinX/jdk/commit/3a4d80197da0c497c844016b9a9fbae541eca9c8 > [3] https://github.com/zhengxiaolinX/jdk/commit/5312cbd8ac860f47b109ab2a99750041865c018d > [4] https://github.com/openjdk/riscv-port/pull/34 > [5] http://cr.openjdk.java.net/~xlinzheng/rvc-size/size/ > [6] http://cr.openjdk.java.net/~xlinzheng/rvc-size/size-full/ Thanks for taking the time measuring those figures :-) It's great to know that your new proposal for supporting RVC works better in respect of codesize metric. I am currently looking at the details of your code changes at [1]. I just realized that your work bears some code cleanup in the first two commits. I would suggest we upstream those code cleanup first if possible. Regards, Fei [1] https://github.com/zhengxiaolinX/jdk/commits/REBASE-rvc-beautify -------------- next part -------------- An HTML attachment was scrubbed... URL: From ludovic at rivosinc.com Fri Sep 23 14:05:07 2022 From: ludovic at rivosinc.com (Ludovic Henry) Date: Fri, 23 Sep 2022 16:05:07 +0200 Subject: Non-zero build crash on kernel 5.17+? In-Reply-To: References: Message-ID: Hi, I did run into the same issue locally. Unfortunately, there doesn't seem to be an option to disable sv57 support in Qemu (I couldn't find anything in the sources either). Using an older kernel (5.17) seems to be the only solution for now. Thanks, Ludovic On Fri, Sep 23, 2022 at 1:17 PM Xiaolin Zheng wrote: > Hi Zixian, > > The current backend supports sv48 and below only. Please see [1] for more > details. > > The kernel 5.17 supports sv48 and 5.18 supports sv57. Your address ` > 0xfffffff71136b8` is a 56-bit address, which is not supported by the > backend currently. > > To bypass this issue, you can try to use kernel 5.17 directly or find if > there are options for QEMU to limit the address space to an sv48 one. > > Not sure if there will be support for a larger address space recently in > the backend, for there seems no hardware supporting even sv48 now. > > > Thanks, > Xiaolin > > [1] > https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/riscv/macroAssembler_riscv.hpp#L910-L914 > > ------------------------------------------------------------------ > From:Zixian Cai > Send Time:2022?9?23?(???) 16:57 > To:riscv-port-dev at openjdk.org > Subject:Non-zero build crash on kernel 5.17+? > > Hi all, > > > > I found that a non-zero build of jdk-20+16 crashes on Ubuntu 22.10 (kernel > 5.19) running on QEMU. > > The same build works on Ubuntu 22.04 (kernel 5.15) running on QEMU. > > The error message is as follows. > > > > # To suppress the following error report, specify this argument > > # after -XX: or in .hotspotrc: SuppressErrorAt=/assembler_riscv.cpp:285 > > # > > # A fatal error has been detected by the Java Runtime Environment: > > # > > # Internal Error > (/home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/cpu/riscv/assembler_riscv.cpp:285), > pid=907, tid=908 > > # assert(is_unsigned_imm_in_range(imm64, 47, 0) || (imm64 == > (int64_t)-1)) failed: bit 47 overflows in address constant > > # > > # JRE version: (20.0) (slowdebug build ) > > # Java VM: OpenJDK 64-Bit Server VM (slowdebug > 20-testing-builds.shipilev.net-openjdk-jdk-b212-20220922, mixed mode, > sharing, tiered, compressed oops, compressed class ptrs, g1 gc, > linux-riscv64) > > # Problematic frame: > > # V [libjvm.so+0x39f41c] Assembler::movptr_with_offset(Register, > unsigned char*, int&)+0x96 > > # > > # Core dump will be written. Default location: Core dumps may be processed > with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" > (or dumping to /home/ubuntu/core.907) > > # > > # An error report file with more information is saved as: > > # /home/ubuntu/hs_err_pid907.log > > # > > # > > > > Here is the backtrace and local variables seen in gdb. > > > > (gdb) bt > > #0 0x00fffffff674941c in Assembler::movptr_with_offset > (this=0xfffffff0000e30, Rd=..., > > addr=0xfffffff71136b8 > char*)> "9q\006\374\"\370", , > offset=@0xfffffff632f00c: 0) > > at > /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/cpu/riscv/assembler_riscv.cpp:284 > > #1 0x00fffffff6f17c5c in MacroAssembler::call_VM_leaf_base > (this=0xfffffff0000e30, > > entry_point=0xfffffff71136b8 > char*)> "9q\006\374\"\370", , > number_of_arguments=2, > > retaddr=0x0) at > /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp:568 > > #2 0x00fffffff6f17da2 in MacroAssembler::call_VM_leaf > (this=0xfffffff0000e30, > > entry_point=0xfffffff71136b8 > char*)> "9q\006\374\"\370", , arg_0=..., > arg_1=...) > > at > /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/cpu/riscv/macroAssembler_riscv.cpp:588 > > #3 0x00fffffff7222308 in StubGenerator::generate_forward_exception > (this=0xfffffff632f1e8) at > /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:546 > > #4 0x00fffffff7231506 in StubGenerator::generate_initial > (this=0xfffffff632f1e8) at > /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3870 > > #5 0x00fffffff7231956 in StubGenerator::StubGenerator > (this=0xfffffff632f1e8, code=0xfffffff632f3c8, phase=0) > > at > /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3971 > > #6 0x00fffffff721faa0 in StubGenerator_generate (code=0xfffffff632f3c8, > phase=0) at > /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/cpu/riscv/stubGenerator_riscv.cpp:3988 > > #7 0x00fffffff72322c8 in StubRoutines::initialize1 () at > /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/share/runtime/stubRoutines.cpp:228 > > #8 0x00fffffff72330d2 in stubRoutines_init1 () at > /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/share/runtime/stubRoutines.cpp:389 > > #9 0x00fffffff6c7823a in init_globals () at > /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/share/runtime/init.cpp:123 > > #10 0x00fffffff72bcc34 in Threads::create_vm (args=0xfffffff632f7e0, > canTryAgain=0xfffffff632f70b) at > /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/share/runtime/threads.cpp:570 > > #11 0x00fffffff6d891ae in JNI_CreateJavaVM_inner (vm=0xfffffff632f838, > penv=0xfffffff632f840, args=0xfffffff632f7e0) > > at > /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/share/prims/jni.cpp:3628 > > #12 0x00fffffff6d893a8 in JNI_CreateJavaVM (vm=0xfffffff632f838, > penv=0xfffffff632f840, args=0xfffffff632f7e0) at > /home/buildbot/worker/build-jdkX-debian10/build/src/hotspot/share/prims/jni.cpp:3714 > > #13 0x00fffffff7fb1a44 in InitializeJVM (pvm=0xfffffff632f838, > penv=0xfffffff632f840, ifn=0xfffffff632f890) > > at > /home/buildbot/worker/build-jdkX-debian10/build/src/java.base/share/native/libjli/java.c:1457 > > #14 0x00fffffff7faef16 in JavaMain (_args=0xffffffffffc0d8) at > /home/buildbot/worker/build-jdkX-debian10/build/src/java.base/share/native/libjli/java.c:413 > > #15 0x00fffffff7fb50ea in ThreadJavaMain (args=0xffffffffffc0d8) at > /home/buildbot/worker/build-jdkX-debian10/build/src/java.base/unix/native/libjli/java_md.c:650 > > #16 0x00fffffff7ed7450 in start_thread (arg=) at > ./nptl/pthread_create.c:442 > > #17 0x00fffffff7f24ed2 in __thread_start () at > ../sysdeps/unix/sysv/linux/riscv/clone.S:85 > > (gdb) info locals > > imm64 = 0xfffffff71136b8 > > imm = 0xfffffff632efb0 > > upper = 0xfffffff632efb0 > > lower = 0xffffff80000000 > > > > I suspect that the issue is due to the newer kernels (5.17+) supports > sv48, and that increases the bits in the addresses that the assembler needs > to handle. See kernel changelog > https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.17. > > > > To reproduce the issue, I use the following. > > Guest Ubuntu 22.10: > https://cdimage.ubuntu.com/ubuntu-server/daily-preinstalled/current/kinetic-preinstalled-server-riscv64+unmatched.img.xz > > Guest Ubuntu 22.04: > https://cdimage.ubuntu.com/releases/22.04.1/release/ubuntu-22.04.1-preinstalled-server-riscv64+unmatched.img.xz > > JDK slowdebug build: > https://builds.shipilev.net/openjdk-jdk/openjdk-jdk-linux-riscv64-server-slowdebug-gcc8-glibc2.28.tar.xz > (OpenJDK 64-Bit Server VM (slowdebug build > 20-testing-builds.shipilev.net-openjdk-jdk-b212-20220922, mixed mode)) > > QEMU: installed via apt on Ubuntu 22.04 host > > QEMU setup: https://wiki.ubuntu.com/RISC-V > > > > Sincerely, > > Zixian > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yangfei at iscas.ac.cn Sat Sep 24 08:06:46 2022 From: yangfei at iscas.ac.cn (yangfei at iscas.ac.cn) Date: Sat, 24 Sep 2022 16:06:46 +0800 (GMT+08:00) Subject: Discuss the RVC implementation In-Reply-To: <9541f786-83a9-4f76-8921-a20a56c6b932.yunyao.zxl@alibaba-inc.com> References: <2d7bbad2-7ade-4b38-91b5-12c4c0a91602.yunyao.zxl@alibaba-inc.com>, <42bdf74a.1322.1834b9400ed.Coremail.yangfei@iscas.ac.cn> <8b9f82ef-6bca-4165-840c-3c06173d2207.yunyao.zxl@alibaba-inc.com>, <9541f786-83a9-4f76-8921-a20a56c6b932.yunyao.zxl@alibaba-inc.com> Message-ID: <351e081f.18355.1836e88da4c.Coremail.yangfei@iscas.ac.cn> Hi Xiaolin, Thanks for the explaination. From your codesize metrics, I see a very low possibility of compressing those branch instructions (beq/bne). So it looks to me that another way to consider here would be not compress these sort of instructions at all. Then the case will be simplified and we won't lose much here. Thanks, Fei -----Original Messages----- From:"Xiaolin Zheng" Sent Time:2022-09-23 21:11:57 (Friday) To: yangfei Cc: riscv-port-dev Subject: Re: Re: Discuss the RVC implementation I forgot to describe something about MachBranchNodes. The thing is, C2 needs to calculate node sizes to allocate buffers, so it has a scratch_emit phase to estimate node size first. It uses a clever strategy to measure MachBranchNodes' size. When estimating the size, we could find only the MacnBranchNode itself matters, not the Label. The labels are just tools for generating branch instructions. So there has "fake label"[1] instead, directly placed at the same pc as the MachBranchNode's to simplify code logic. On other platforms like x86 and aarch64, the size of branch instructions is not changed, and these platforms don't have a code size reduction extension as RISC-V. For example, on other platforms, the jcc is jcc, and the bl is bl. In our implementation, we have: ``` #define INSN(NAME) \ void NAME(Register Rd, const int32_t offset) { \ /* jal -> c.j */ \ if (do_compress() ...) { \ c_j(offset); \ return; \ } \ _jal(Rd, offset); \ } INSN(jal); #undef INSN ``` The size of an emitted instruction is determined by the `offset`. Though reasonable, it is not compatible with the "fake label" strategy. For example, with the "fake label", the offset is always 0 when scratch-emitting a MachBranchNode. The offset does not match the real offset. Therefore, In scratch_emit and the real emission, the size of MachBranchNode might be different, which will break the assumption of C2's strategy. To emit the code that we want, a basic approach is to pass the real offset into the MachBranchNode, and let us read it instead of the "0" every time. So currently in these patches, all MachBranchNodes are temporarily incompressible in C2 when RVC is enabled. -------------- next part -------------- An HTML attachment was scrubbed... URL: From yunyao.zxl at alibaba-inc.com Mon Sep 26 02:54:15 2022 From: yunyao.zxl at alibaba-inc.com (Xiaolin Zheng) Date: Mon, 26 Sep 2022 10:54:15 +0800 Subject: =?UTF-8?B?UmU6IFJlOiBSZTogRGlzY3VzcyB0aGUgUlZDIGltcGxlbWVudGF0aW9u?= In-Reply-To: <351e081f.18355.1836e88da4c.Coremail.yangfei@iscas.ac.cn> References: <2d7bbad2-7ade-4b38-91b5-12c4c0a91602.yunyao.zxl@alibaba-inc.com>, <42bdf74a.1322.1834b9400ed.Coremail.yangfei@iscas.ac.cn> <8b9f82ef-6bca-4165-840c-3c06173d2207.yunyao.zxl@alibaba-inc.com>, <9541f786-83a9-4f76-8921-a20a56c6b932.yunyao.zxl@alibaba-inc.com>, <351e081f.18355.1836e88da4c.Coremail.yangfei@iscas.ac.cn> Message-ID: Hi Felix, Thank you for the reasonable advice and the time taken for the pre-reviews. I consider this a very nice approach to removing some complexities for MachBranchNodes in the first patches. We can certainly put them afterward to investigate the effectiveness of compressing MachBranchNode things. So will do that. Thanks, Xiaolin ------------------------------------------------------------------ From:yangfei Send Time:2022?9?24?(???) 16:06 To:???(??) Cc:riscv-port-dev Subject:Re: Re: Re: Discuss the RVC implementation Hi Xiaolin, Thanks for the explaination. From your codesize metrics, I see a very low possibility of compressing those branch instructions (beq/bne). So it looks to me that another way to consider here would be not compress these sort of instructions at all. Then the case will be simplified and we won't lose much here. Thanks, Fei -----Original Messages----- From:"Xiaolin Zheng" Sent Time:2022-09-23 21:11:57 (Friday) To: yangfei Cc: riscv-port-dev Subject: Re: Re: Discuss the RVC implementation I forgot to describe something about MachBranchNodes. The thing is, C2 needs to calculate node sizes to allocate buffers, so it has a scratch_emit phase to estimate node size first. It uses a clever strategy to measure MachBranchNodes' size. When estimating the size, we could find only the MacnBranchNode itself matters, not the Label. The labels are just tools for generating branch instructions. So there has "fake label"[1] instead, directly placed at the same pc as the MachBranchNode's to simplify code logic. On other platforms like x86 and aarch64, the size of branch instructions is not changed, and these platforms don't have a code size reduction extension as RISC-V. For example, on other platforms, the jcc is jcc, and the bl is bl. In our implementation, we have: ``` #define INSN(NAME) \ void NAME(Register Rd, const int32_t offset) { \ /* jal -> c.j */ \ if (do_compress() ...) { \ c_j(offset); \ return; \ } \ _jal(Rd, offset); \ } INSN(jal); #undef INSN ``` The size of an emitted instruction is determined by the `offset`. Though reasonable, it is not compatible with the "fake label" strategy. For example, with the "fake label", the offset is always 0 when scratch-emitting a MachBranchNode. The offset does not match the real offset. Therefore, In scratch_emit and the real emission, the size of MachBranchNode might be different, which will break the assumption of C2's strategy. To emit the code that we want, a basic approach is to pass the real offset into the MachBranchNode, and let us read it instead of the "0" every time. So currently in these patches, all MachBranchNodes are temporarily incompressible in C2 when RVC is enabled. -------------- next part -------------- An HTML attachment was scrubbed... URL: From vladimir.kempik at gmail.com Wed Sep 28 14:37:23 2022 From: vladimir.kempik at gmail.com (Vladimir Kempik) Date: Wed, 28 Sep 2022 17:37:23 +0300 Subject: Issue with llvm compiled jvm Message-ID: <4795B212-813F-4D48-AEC8-3EC6740F55D0@gmail.com> Hello I was playing with clang compiled hotspot and found an issue in one configuration: clang + sysroot from gcc ( aka link with libgcc_s.so.1) in such combo the __builtin___clear_cache() function calls __clear_cache from libgcc_s.so which is basically a dummy function doing nothing. it doesn?t happen when using gcc, it shouldn?t happen if clang is used with compiler-rt libs ( where __clear_cache is properly implemented) it?s a bug of compiler, but we may want to make a workaround: #IFDEF llvm THEN (use old style direct call of syscall OR __riscv_flush_icache(..)) ELSE __builtin___clear_cache(..) Looking for opinions - should we implement a workaround in openjdk or just ignore it ? Regards, Vladimir From yangfei at iscas.ac.cn Thu Sep 29 09:02:51 2022 From: yangfei at iscas.ac.cn (yangfei at iscas.ac.cn) Date: Thu, 29 Sep 2022 17:02:51 +0800 (GMT+08:00) Subject: Discuss the RVC implementation In-Reply-To: <2d7bbad2-7ade-4b38-91b5-12c4c0a91602.yunyao.zxl@alibaba-inc.com> References: <2d7bbad2-7ade-4b38-91b5-12c4c0a91602.yunyao.zxl@alibaba-inc.com> Message-ID: <77e347f0.29ad8.183887bfeb5.Coremail.yangfei@iscas.ac.cn> Hi Xiaolin, I happened to have another possible proposal, please consider. Instead of planting an IncompressibleRegion variable in a code block, we can explicitly choose to use the normal 4-byte encoding instructions for fixed-length code snippet or in places where code patching could happen. For example, we have three versions for adding immediate: 1. '_addi' - 4-byte encoding; 2. 'c_addi' - 2-byte encoding; 3. 'addi' - Call '_addi' or 'c_addi' based on compress condition; Then for the incompressible code, we would use '_addi' so we are ensuring the patching logic will receive 4-byte encoding for adding immediate. But for the other compressible code, we would use 'addi' to benifit from RVC extension when available. Then we could eliminate use of both IncompressibleRegion and CompressibleRegion. It looks to me that this way will be fairly straightforward and more readable compared with your current proposal. But I guess we might need some small refactoring for the assembler functions if we go this way. Thanks, Fei -----Original Messages----- From:"Xiaolin Zheng" Sent Time:2022-09-15 10:52:59 (Thursday) To: riscv-port-dev Cc: Subject: Discuss the RVC implementation Hi team, I am going to describe a different implementation of RVC for our backend. ## Background The RISC-V C extension, also known as RVC, could transform 4-byte instructions to 2-byte counterparts when eligible (for example, as the manual, Rd/Rs of instruction ranges from [x8,x15] might be one common requirement, etc.). ## The current implementation in the Hotspot The current implementation[0] is a transient one, introducing a "CompressibleRegion" by using RTTI[1] to indicate that instructions inside these regions can be safely substituted by the RVC counterparts, if convertible; and the implementation also uses a, say, "whitelist mode" by using the "CompressibleRegion" mentioned above to "manually mark out safe regions", then batch emit them if could. However, after a deeper look, we might discover the current "whitelist mode" has several shortages: ## Shortages of the current implementation 1. Coverages: The current implementation only covers some of C2 match rules, and only some small part of stub code, so there is obviously far more space to reduce the total code size. In my observations, some RISC-V instruction sequences generally occupy a bit more space than AArch64 ones[2]. With the new implementations, we could achieve a code size level alike AArch64's generated code. Some better, some still worse than AArch64 one in my simple observation. 2. Though safe, I'd say it's very much not easy to maintain. The background is, most of the patchable instructions cannot be easily transformed into their shorter counterparts[3], and they need to be prevented from being compressed. So comes the question: we must make sure no patchable relocation is inside the range of a "CompressibleRegion". For example, the string comparison intrinsic function[4] looks very delicious: transforming it and its siblings may result in a yummy compression rate. But programmers might have to check lots of its callees to find if there is just one patchable relocation hidden inside that causes the whole intrinsic incompressible. This could cause extra burden for programmers, so I bet no one would like to add "CompressibleRegion" for his/her code :-) 3. Performance: Better performance of generated code is a little side effect this extension gives us, the smaller the I$ size, the better performance though - please see Andrew Waterman's paper[5] for more reference there. Anyway, it looks like a higher general compression rate is better for performance. The main issue here is the granularity of "CompressibleRegion" is a bit coarse. "Why not exclude the incompressible parts" may come up to us naturally. And after some diggings, we may find: we just need to exclude countable places that would be patched back (mostly relocations), and several code slices with a fixed length, which will be calculated, such as "emit_static_call_stub". All remaining instructions could be safely transformed into RVC counterparts if eligible. So maybe, say, the "blacklist mode"? ## The new implementation To implement the "blacklist mode" in the backend, we need two things: 1. an "IncompressibleRegion", indicating instructions inside it should remain in their normal 4-byte form no matter what happens. 2. a simple strategy to exclude patchable instructions, mainly for relocations. So we can see the new strategy is highly bounded to relocations' positions: We all know the "relocate()" in Hotspot VM is a mark that only has an explicit "start point" without an end point, and some of them could be patched back. Therefore, we can use a simple strategy: introduce a lambda as another argument to assign "end point" semantics to the relocations, for completing our requirements without extra costs. For example: Originally: ``` __ relocate(safepoint_pc.rspec()); __ la(t0, safepoint_pc.target()); __ sd(t0, Address(xthread, JavaThread::saved_exception_pc_offset())); ``` After introducing a simple lambda as an extra argument: ``` __ relocate(safepoint_pc.rspec(), [&] { // The relocate() hides an "IncompressibleRegion" in it __ la(t0, safepoint_pc.target()); // This patchable instruction sequence is incompressible }); _ sd(t0, Address(xthread, JavaThread::saved_exception_pc_offset())); ``` Well, simple but effective. Excluding such countable dynamically patchable places and unifying all relocations, all other instructions can be safely transformed, without messing up the current code style. Programmers could just keep aligning the same style; most of the time they have no need to care about whether the RVC exists or not and things get converted automatically. The proposed new sample code is again, here[6]. ## Other things worth being noticed 1. Instruction patching issues With the C extension, the backend mixes with both 2-byte and 4-byte instructions. It gets a little CISC alike. We know the Hotspot would patch instructions when code is running at full speed, such as call instructions, nops used for deoptimizations (the nops at the entry points, and post-call nops after loom). Instruction patching is delicate so we must carefully handle such places, to keep these 4-byte instructions from spanning cachelines. Though remaining a 4-byte normal form even with RVC, they might sit at a 2-byte aligned boundary. Such cases should definitely not happen, for patching such places spanning cachelines would lose the atomicity. So shortly, we must properly align them, such as [7][8]. Such a problem could exist with RVC, no matter "whitelist mode" or "blacklist mode". It is a general problem for instruction patching. I will add more strong assertions to the potential places (trampoline_call might be a very good spot, for patchable "static_call", "opt_virtual" and "virtual" relocations) to check alignment in the future patches. 2. MachBranch Nodes And MachBranch nodes: they are not easy to be tamed because the "fake label"[9] in PhaseOutput::scratch_emit_size() cannot tell us the real distance of the label. But we can leave them alone in this discussion, for there will be patches to handle those afterward. That's nearly all. Thanks for reaching here despite the verbosity. It would be very nice to receive any suggestions. Best, Xiaolin [0] Original patch: https://github.com/openjdk/riscv-port/pull/34 [1] Of course, the "CompressibleRegion" is good, I like it; and this idea is not from myself. [2] For a simple example, a much commonly used fixed-length movptr() uses up six 4-byte instructions (lui+addi+slli+addi+slli+addi, MIPS alike instructions using arithmetical calculations with signed extensions, but not anyone's fault :-) ), while the AArch64 counterpart only takes three 4-byte instructions (movz+movk+movk). They are both going to mov a 48-bit immediate. After accumulation, the size differs quite a lot. [3] 2-byte instructions have fewer bits, so comes shorter immediate encoding etc. compared to the 4-byte counterparts. After we transform patchable instructions (ones at marks of patchable relocations, etc.) to 2-byte ones, when they are patched to a larger value or farther distances afterward, it is possible that they sadly find themselves, the shorter instructions, cannot cover the newly patched value. So we need to exclude patchable instructions (at the relocation marks etc.) from being compressed. [4] https://github.com/openjdk/jdk/blob/7f3250d71c4866a64eb73f52140c669fe90f122f/src/hotspot/cpu/riscv/riscv.ad#L10032-L10035 [5] https://digitalassets.lib.berkeley.edu/etd/ucb/text/Waterman_berkeley_0028E_15908.pdf, Page 64: "5.4 The RVC Extension, Performance Implications" [6] https://github.com/zhengxiaolinX/jdk/tree/REBASE-rvc-beautify [7] https://github.com/openjdk/jdk/blob/7f3250d71c4866a64eb73f52140c669fe90f122f/src/hotspot/cpu/riscv/riscv.ad#L9873 [8] https://github.com/openjdk/jdk/blob/7f3250d71c4866a64eb73f52140c669fe90f122f/src/hotspot/cpu/riscv/c1_LIRAssembler_riscv.cpp#L1348-L1353 [9] https://github.com/openjdk/jdk/blob/211fab8d361822bbd1a34a88626853bf4a029af5/src/hotspot/share/opto/output.cpp#L3331-L3340 -------------- next part -------------- An HTML attachment was scrubbed... URL: From yunyao.zxl at alibaba-inc.com Fri Sep 30 10:25:12 2022 From: yunyao.zxl at alibaba-inc.com (Xiaolin Zheng) Date: Fri, 30 Sep 2022 18:25:12 +0800 Subject: =?UTF-8?B?UmU6IERpc2N1c3MgdGhlIFJWQyBpbXBsZW1lbnRhdGlvbg==?= In-Reply-To: <77e347f0.29ad8.183887bfeb5.Coremail.yangfei@iscas.ac.cn> References: <2d7bbad2-7ade-4b38-91b5-12c4c0a91602.yunyao.zxl@alibaba-inc.com>, <77e347f0.29ad8.183887bfeb5.Coremail.yangfei@iscas.ac.cn> Message-ID: <3d27d06d-6cac-44d2-90b7-15b4ebb07ddd.yunyao.zxl@alibaba-inc.com> Hi Felix, Thank you for taking the time to consider this, and the discussions. I think it's certainly a fairly good observation, regarding the three versions that can theoretically cover any case in combination, in an instruction-level granularity. But in reality, I may have some of my personal practices to share: such might be too fine-grained to implement a high-level control, please let me explain it. Let alone correctness, there are also code styles and maintenance that we have to focus on for sure. For example, if we want to rewrite one piece of code[1] with a fixed length by removing the `IncompressibleRegion` thing, to an instruction-level granularity, it might become [2]. Please see my comments in that gist. 1. From the code style aspect: We can see it is not looking so promising. In fact, my RVC prototype was in exactly the same way as your thought (so I guess it might be an intuitive and general thought :-) ), in an instruction-level granularity. And I sadly found the code style was messy even to myself. We have to overload lots of things such as _ld(Register, Address), _ld(Register, address), (see my comments) and so on to fulfill any usage in an incompressible piece of code: the overall API changes (like _ld in any form) are not convergent. In the comments from the gist, we can see we certainly have to make incompressible all the callees, even the callees of the callees, and so on, in a transitive relation. For example, the 'la(Register, Address)' API itself must be incompressible if we are in an instruction granularity. So we have to make its callee, 'la(Register, address)' API incompressible as well, and so on. It might be indeed an inferno... 2. From the compression rate aspect: Besides, we are just talking about la() here. If we directly mark la()s as incompressible, then the la()s called by actually safe and compressible code will be left as incompressible forever. The compression rate will be definitely lower: the main issue here is, of course, the granularity problem -- instruction-level granularity is too fine-grained, which cannot allow us to make high-level controls. The current `CompressibleRegion` combined with `IncompressibleRegion` can implement a function-level granularity (neither too fine nor too coarse), which I think is very suitable for the current backend, that we can use them combined to mark everything without many efforts and with a concentration (like the current implementation: the unified relocate() with a lambda[3] and an IncompressibleRegion hidden inside). With them both, we can avoid the above problems with no effort, please see the first line of [1]: the incompressible region directly controls the current function, marking THE 'la' it currently uses incompressible, without affecting the 'la' definitions themselves (movptr, ld ... are as well). So we can avoid lots of invasions to the current backend code base. Nice, right? 3. From the maintenance aspect: Explicitly adding '_' to every compressible instruction might be a burden for developers and porters. One may say, just adding some '_'s, why burdens? In fact, considering we are porting code like [1] again from AArch64 port. We not only have to change instructions to RISC-V's, but also have to consider RVC... does one instruction have '_' or not? Do its callees, even its callees' callees, have an incompressible version? Even if to myself, it might be a heavy burden :-) I might feel very troublesome - I may just want to ctrl+c and ctrl+v some code without other confusion. So, why not directly throw an `IncompressionRegion` to that stub with a fixed length, so that programmers can normally write their code with the normal "ld", "la" and "addi"? Everything is easily solved without caring for the trifling :-) Just sharing some practices from the same thought and might be verbose again -- there are things not easy to foresee at a glance. When implementing, the pitfalls might be obvious then. From my personal perspective, I may consider the CompressibleRegions plan looks better though, and I am looking forward to your views and suggestions. Best, Xiaolin [1] https://github.com/zhengxiaolinX/jdk/blob/2ee3204ace5a7767482819be2240982cc0744f8c/src/hotspot/cpu/riscv/gc/shared/barrierSetAssembler_riscv.cpp#L196-L275 [2] https://gist.github.com/zhengxiaolinX/3151db356a9001f58827d272c8330bb7 [3] https://github.com/zhengxiaolinX/jdk/blob/2ee3204ace5a7767482819be2240982cc0744f8c/src/hotspot/cpu/riscv/assembler_riscv.hpp#L2167-L2178 ------------------------------------------------------------------ From:yangfei Send Time:2022?9?29?(???) 17:02 To:???(??) Cc:riscv-port-dev Subject:Re: Discuss the RVC implementation Hi Xiaolin, I happened to have another possible proposal, please consider. Instead of planting an IncompressibleRegion variable in a code block, we can explicitly choose to use the normal 4-byte encoding instructions for fixed-length code snippet or in places where code patching could happen. For example, we have three versions for adding immediate: 1. '_addi' - 4-byte encoding; 2. 'c_addi' - 2-byte encoding; 3. 'addi' - Call '_addi' or 'c_addi' based on compress condition; Then for the incompressible code, we would use '_addi' so we are ensuring the patching logic will receive 4-byte encoding for adding immediate. But for the other compressible code, we would use 'addi' to benifit from RVC extension when available. Then we could eliminate use of both IncompressibleRegion and CompressibleRegion. It looks to me that this way will be fairly straightforward and more readable compared with your current proposal. But I guess we might need some small refactoring for the assembler functions if we go this way. Thanks, Fei -----Original Messages----- From:"Xiaolin Zheng" Sent Time:2022-09-15 10:52:59 (Thursday) To: riscv-port-dev Cc: Subject: Discuss the RVC implementation Hi team, I am going to describe a different implementation of RVC for our backend. ## Background The RISC-V C extension, also known as RVC, could transform 4-byte instructions to 2-byte counterparts when eligible (for example, as the manual, Rd/Rs of instruction ranges from [x8,x15] might be one common requirement, etc.). ## The current implementation in the Hotspot The current implementation[0] is a transient one, introducing a "CompressibleRegion" by using RTTI[1] to indicate that instructions inside these regions can be safely substituted by the RVC counterparts, if convertible; and the implementation also uses a, say, "whitelist mode" by using the "CompressibleRegion" mentioned above to "manually mark out safe regions", then batch emit them if could. However, after a deeper look, we might discover the current "whitelist mode" has several shortages: ## Shortages of the current implementation 1. Coverages: The current implementation only covers some of C2 match rules, and only some small part of stub code, so there is obviously far more space to reduce the total code size. In my observations, some RISC-V instruction sequences generally occupy a bit more space than AArch64 ones[2]. With the new implementations, we could achieve a code size level alike AArch64's generated code. Some better, some still worse than AArch64 one in my simple observation. 2. Though safe, I'd say it's very much not easy to maintain. The background is, most of the patchable instructions cannot be easily transformed into their shorter counterparts[3], and they need to be prevented from being compressed. So comes the question: we must make sure no patchable relocation is inside the range of a "CompressibleRegion". For example, the string comparison intrinsic function[4] looks very delicious: transforming it and its siblings may result in a yummy compression rate. But programmers might have to check lots of its callees to find if there is just one patchable relocation hidden inside that causes the whole intrinsic incompressible. This could cause extra burden for programmers, so I bet no one would like to add "CompressibleRegion" for his/her code :-) 3. Performance: Better performance of generated code is a little side effect this extension gives us, the smaller the I$ size, the better performance though - please see Andrew Waterman's paper[5] for more reference there. Anyway, it looks like a higher general compression rate is better for performance. The main issue here is the granularity of "CompressibleRegion" is a bit coarse. "Why not exclude the incompressible parts" may come up to us naturally. And after some diggings, we may find: we just need to exclude countable places that would be patched back (mostly relocations), and several code slices with a fixed length, which will be calculated, such as "emit_static_call_stub". All remaining instructions could be safely transformed into RVC counterparts if eligible. So maybe, say, the "blacklist mode"? ## The new implementation To implement the "blacklist mode" in the backend, we need two things: 1. an "IncompressibleRegion", indicating instructions inside it should remain in their normal 4-byte form no matter what happens. 2. a simple strategy to exclude patchable instructions, mainly for relocations. So we can see the new strategy is highly bounded to relocations' positions: We all know the "relocate()" in Hotspot VM is a mark that only has an explicit "start point" without an end point, and some of them could be patched back. Therefore, we can use a simple strategy: introduce a lambda as another argument to assign "end point" semantics to the relocations, for completing our requirements without extra costs. For example: Originally: ``` __ relocate(safepoint_pc.rspec()); __ la(t0, safepoint_pc.target()); __ sd(t0, Address(xthread, JavaThread::saved_exception_pc_offset())); ``` After introducing a simple lambda as an extra argument: ``` __ relocate(safepoint_pc.rspec(), [&] { // The relocate() hides an "IncompressibleRegion" in it __ la(t0, safepoint_pc.target()); // This patchable instruction sequence is incompressible }); _ sd(t0, Address(xthread, JavaThread::saved_exception_pc_offset())); ``` Well, simple but effective. Excluding such countable dynamically patchable places and unifying all relocations, all other instructions can be safely transformed, without messing up the current code style. Programmers could just keep aligning the same style; most of the time they have no need to care about whether the RVC exists or not and things get converted automatically. The proposed new sample code is again, here[6]. ## Other things worth being noticed 1. Instruction patching issues With the C extension, the backend mixes with both 2-byte and 4-byte instructions. It gets a little CISC alike. We know the Hotspot would patch instructions when code is running at full speed, such as call instructions, nops used for deoptimizations (the nops at the entry points, and post-call nops after loom). Instruction patching is delicate so we must carefully handle such places, to keep these 4-byte instructions from spanning cachelines. Though remaining a 4-byte normal form even with RVC, they might sit at a 2-byte aligned boundary. Such cases should definitely not happen, for patching such places spanning cachelines would lose the atomicity. So shortly, we must properly align them, such as [7][8]. Such a problem could exist with RVC, no matter "whitelist mode" or "blacklist mode". It is a general problem for instruction patching. I will add more strong assertions to the potential places (trampoline_call might be a very good spot, for patchable "static_call", "opt_virtual" and "virtual" relocations) to check alignment in the future patches. 2. MachBranch Nodes And MachBranch nodes: they are not easy to be tamed because the "fake label"[9] in PhaseOutput::scratch_emit_size() cannot tell us the real distance of the label. But we can leave them alone in this discussion, for there will be patches to handle those afterward. That's nearly all. Thanks for reaching here despite the verbosity. It would be very nice to receive any suggestions. Best, Xiaolin [0] Original patch: https://github.com/openjdk/riscv-port/pull/34 [1] Of course, the "CompressibleRegion" is good, I like it; and this idea is not from myself. [2] For a simple example, a much commonly used fixed-length movptr() uses up six 4-byte instructions (lui+addi+slli+addi+slli+addi, MIPS alike instructions using arithmetical calculations with signed extensions, but not anyone's fault :-) ), while the AArch64 counterpart only takes three 4-byte instructions (movz+movk+movk). They are both going to mov a 48-bit immediate. After accumulation, the size differs quite a lot. [3] 2-byte instructions have fewer bits, so comes shorter immediate encoding etc. compared to the 4-byte counterparts. After we transform patchable instructions (ones at marks of patchable relocations, etc.) to 2-byte ones, when they are patched to a larger value or farther distances afterward, it is possible that they sadly find themselves, the shorter instructions, cannot cover the newly patched value. So we need to exclude patchable instructions (at the relocation marks etc.) from being compressed. [4] https://github.com/openjdk/jdk/blob/7f3250d71c4866a64eb73f52140c669fe90f122f/src/hotspot/cpu/riscv/riscv.ad#L10032-L10035 [5] https://digitalassets.lib.berkeley.edu/etd/ucb/text/Waterman_berkeley_0028E_15908.pdf , Page 64: "5.4 The RVC Extension, Performance Implications" [6] https://github.com/zhengxiaolinX/jdk/tree/REBASE-rvc-beautify [7] https://github.com/openjdk/jdk/blob/7f3250d71c4866a64eb73f52140c669fe90f122f/src/hotspot/cpu/riscv/riscv.ad#L9873 [8] https://github.com/openjdk/jdk/blob/7f3250d71c4866a64eb73f52140c669fe90f122f/src/hotspot/cpu/riscv/c1_LIRAssembler_riscv.cpp#L1348-L1353 [9] https://github.com/openjdk/jdk/blob/211fab8d361822bbd1a34a88626853bf4a029af5/src/hotspot/share/opto/output.cpp#L3331-L3340 -------------- next part -------------- An HTML attachment was scrubbed... URL: