Re: Discuss the RVC implementation

Fri Sep 30 10:25:12 UTC 2022

Hi Felix,
Thank you for taking the time to consider this, and the discussions.
I think it's certainly a fairly good observation, regarding the three versions that can theoretically cover any case in combination, in an instruction-level granularity. But in reality, I may have some of my personal practices to share: such might be too fine-grained to implement a high-level control, please let me explain it.
Let alone correctness, there are also code styles and maintenance that we have to focus on for sure. For example, if we want to rewrite one piece of code[1] with a fixed length by removing the `IncompressibleRegion` thing, to an instruction-level granularity, it might become [2]. Please see my comments in that gist.
1. From the code style aspect:
We can see it is not looking so promising. In fact, my RVC prototype was in exactly the same way as your thought (so I guess it might be an intuitive and general thought :-) ), in an instruction-level granularity. And I sadly found the code style was messy even to myself. We have to overload lots of things such as _ld(Register, Address), _ld(Register, address), (see my comments) and so on to fulfill any usage in an incompressible piece of code: the overall API changes (like _ld in any form) are not convergent.
In the comments from the gist, we can see we certainly have to make incompressible all the callees, even the callees of the callees, and so on, in a transitive relation. For example, the 'la(Register, Address)' API itself must be incompressible if we are in an instruction granularity. So we have to make its callee, 'la(Register, address)' API incompressible as well, and so on. It might be indeed an inferno...
2. From the compression rate aspect:
Besides, we are just talking about la() here. If we directly mark la()s as incompressible, then the la()s called by actually safe and compressible code will be left as incompressible forever. The compression rate will be definitely lower: the main issue here is, of course, the granularity problem -- instruction-level granularity is too fine-grained, which cannot allow us to make high-level controls.
The current `CompressibleRegion` combined with `IncompressibleRegion` can implement a function-level granularity (neither too fine nor too coarse), which I think is very suitable for the current backend, that we can use them combined to mark everything without many efforts and with a concentration (like the current implementation: the unified relocate() with a lambda[3] and an IncompressibleRegion hidden inside). With them both, we can avoid the above problems with no effort, please see the first line of [1]: the incompressible region directly controls the current function, marking THE 'la' it currently uses incompressible, without affecting the 'la' definitions themselves (movptr, ld ... are as well). So we can avoid lots of invasions to the current backend code base. Nice, right?
3. From the maintenance aspect:
Explicitly adding '_' to every compressible instruction might be a burden for developers and porters. One may say, just adding some '_'s, why burdens? In fact, considering we are porting code like [1] again from AArch64 port. We not only have to change instructions to RISC-V's, but also have to consider RVC... does one instruction have '_' or not? Do its callees, even its callees' callees, have an incompressible version? Even if to myself, it might be a heavy burden :-) I might feel very troublesome - I may just want to ctrl+c and ctrl+v some code without other confusion. So, why not directly throw an `IncompressionRegion` to that stub with a fixed length, so that programmers can normally write their code with the normal "ld", "la" and "addi"? Everything is easily solved without caring for the trifling :-)
Just sharing some practices from the same thought and might be verbose again -- there are things not easy to foresee at a glance. When implementing, the pitfalls might be obvious then. From my personal perspective, I may consider the CompressibleRegions plan looks better though, and I am looking forward to your views and suggestions.
Best,
Xiaolin
[1] https://github.com/zhengxiaolinX/jdk/blob/2ee3204ace5a7767482819be2240982cc0744f8c/src/hotspot/cpu/riscv/gc/shared/barrierSetAssembler_riscv.cpp#L196-L275 <https://github.com/zhengxiaolinX/jdk/blob/2ee3204ace5a7767482819be2240982cc0744f8c/src/hotspot/cpu/riscv/gc/shared/barrierSetAssembler_riscv.cpp#L196-L275 >
[2] https://gist.github.com/zhengxiaolinX/3151db356a9001f58827d272c8330bb7 <https://gist.github.com/zhengxiaolinX/3151db356a9001f58827d272c8330bb7 >
[3] https://github.com/zhengxiaolinX/jdk/blob/2ee3204ace5a7767482819be2240982cc0744f8c/src/hotspot/cpu/riscv/assembler_riscv.hpp#L2167-L2178 <https://github.com/zhengxiaolinX/jdk/blob/2ee3204ace5a7767482819be2240982cc0744f8c/src/hotspot/cpu/riscv/assembler_riscv.hpp#L2167-L2178 >
------------------------------------------------------------------
From:yangfei <yangfei at iscas.ac.cn>
Send Time:2022年9月29日(星期四) 17:02
To:郑孝林(云矅) <yunyao.zxl at alibaba-inc.com>
Cc:riscv-port-dev <riscv-port-dev at openjdk.org>
Subject:Re: Discuss the RVC implementation
Hi Xiaolin,
I happened to have another possible proposal, please consider.
Instead of planting an IncompressibleRegion variable in a code block, we can explicitly
choose to use the normal 4-byte encoding instructions for fixed-length code snippet or
in places where code patching could happen.
For example, we have three versions for adding immediate:
 1. '_addi' - 4-byte encoding;
 2. 'c_addi' - 2-byte encoding;
 3. 'addi' - Call '_addi' or 'c_addi' based on compress condition;
Then for the incompressible code, we would use '_addi' so we are ensuring the patching
logic will receive 4-byte encoding for adding immediate. But for the other compressible
code, we would use 'addi' to benifit from RVC extension when available. Then we could
eliminate use of both IncompressibleRegion and CompressibleRegion.
It looks to me that this way will be fairly straightforward and more readable compared
with your current proposal. But I guess we might need some small refactoring for the
assembler functions if we go this way. 
Thanks,
Fei
 -----Original Messages-----
From:"Xiaolin Zheng" <yunyao.zxl at alibaba-inc.com>
Sent Time:2022-09-15 10:52:59 (Thursday)
To: riscv-port-dev <riscv-port-dev at openjdk.org>
Cc: 
Subject: Discuss the RVC implementation
 Hi team, 
 I am going to describe a different implementation of RVC for our backend. 
 ## Background
 The RISC-V C extension, also known as RVC, could transform 4-byte instructions to 2-byte counterparts when eligible (for example, as the manual, Rd/Rs of instruction ranges from [x8,x15] might be one common requirement, etc.).
 ## The current implementation in the Hotspot 
 The current implementation[0] is a transient one, introducing a "CompressibleRegion" by using RTTI[1] to indicate that instructions inside these regions can be safely substituted by the RVC counterparts, if convertible; and the implementation also uses a, say, "whitelist mode" by using the "CompressibleRegion" mentioned above to "manually mark out safe regions", then batch emit them if could. However, after a deeper look, we might discover the current "whitelist mode" has several shortages: 
 ## Shortages of the current implementation 
 1. Coverages: 
 The current implementation only covers some of C2 match rules, and only some small part of stub code, so there is obviously far more space to reduce the total code size. In my observations, some RISC-V instruction sequences generally occupy a bit more space than AArch64 ones[2]. With the new implementations, we could achieve a code size level alike AArch64's generated code. Some better, some still worse than AArch64 one in my simple observation. 
 2. Though safe, I'd say it's very much not easy to maintain. The background is, most of the patchable instructions cannot be easily transformed into their shorter counterparts[3], and they need to be prevented from being compressed. So comes the question: we must make sure no patchable relocation is inside the range of a "CompressibleRegion". For example, the string comparison intrinsic function[4] looks very delicious: transforming it and its siblings may result in a yummy compression rate. But programmers might have to check lots of its callees to find if there is just one patchable relocation hidden inside that causes the whole intrinsic incompressible. This could cause extra burden for programmers, so I bet no one would like to add "CompressibleRegion" for his/her code :-) 
 3. Performance: 
 Better performance of generated code is a little side effect this extension gives us, the smaller the I$ size, the better performance though - please see Andrew Waterman's paper[5] for more reference there. Anyway, it looks like a higher general compression rate is better for performance. 
 The main issue here is the granularity of "CompressibleRegion" is a bit coarse. "Why not exclude the incompressible parts" may come up to us naturally. And after some diggings, we may find: we just need to exclude countable places that would be patched back (mostly relocations), and several code slices with a fixed length, which will be calculated, such as "emit_static_call_stub". All remaining instructions could be safely transformed into RVC counterparts if eligible. So maybe, say, the "blacklist mode"? 
 ## The new implementation 
 To implement the "blacklist mode" in the backend, we need two things: 
 1. an "IncompressibleRegion", indicating instructions inside it should remain in their normal 4-byte form no matter what happens. 
 2. a simple strategy to exclude patchable instructions, mainly for relocations. So we can see the new strategy is highly bounded to relocations' positions: 
 We all know the "relocate()" in Hotspot VM is a mark that only has an explicit "start point" without an end point, and some of them could be patched back. Therefore, we can use a simple strategy: introduce a lambda as another argument to assign "end point" semantics to the relocations, for completing our requirements without extra costs. For example: 
 Originally: 
 ``` 
 __ relocate(safepoint_pc.rspec()); 
 __ la(t0, safepoint_pc.target()); 
 __ sd(t0, Address(xthread, JavaThread::saved_exception_pc_offset())); 
 ``` 
 After introducing a simple lambda as an extra argument: 
 ``` 
 __ relocate(safepoint_pc.rspec(), [&] { // The relocate() hides an "IncompressibleRegion" in it 
 __ la(t0, safepoint_pc.target()); // This patchable instruction sequence is incompressible 
 }); 
 _ sd(t0, Address(xthread, JavaThread::saved_exception_pc_offset())); 
 ``` 
 Well, simple but effective. Excluding such countable dynamically patchable places and unifying all relocations, all other instructions can be safely transformed, without messing up the current code style. Programmers could just keep aligning the same style; most of the time they have no need to care about whether the RVC exists or not and things get converted automatically. The proposed new sample code is again, here[6]. 
 ## Other things worth being noticed 
 1. Instruction patching issues
 With the C extension, the backend mixes with both 2-byte and 4-byte instructions. It gets a little CISC alike. We know the Hotspot would patch instructions when code is running at full speed, such as call instructions, nops used for deoptimizations (the nops at the entry points, and post-call nops after loom). Instruction patching is delicate so we must carefully handle such places, to keep these 4-byte instructions from spanning cachelines. Though remaining a 4-byte normal form even with RVC, they might sit at a 2-byte aligned boundary. Such cases should definitely not happen, for patching such places spanning cachelines would lose the atomicity. So shortly, we must properly align them, such as [7][8]. Such a problem could exist with RVC, no matter "whitelist mode" or "blacklist mode". It is a general problem for instruction patching. I will add more strong assertions to the potential places (trampoline_call might be a very good spot, for patchable "static_call", "opt_virtual" and "virtual" relocations) to check alignment in the future patches. 
 2. MachBranch Nodes 
 And MachBranch nodes: they are not easy to be tamed because the "fake label"[9] in PhaseOutput::scratch_emit_size() cannot tell us the real distance of the label. But we can leave them alone in this discussion, for there will be patches to handle those afterward. 
 That's nearly all. Thanks for reaching here despite the verbosity. It would be very nice to receive any suggestions. 
 Best, 
 Xiaolin 
 [0] Original patch: https://github.com/openjdk/riscv-port/pull/34 <https://github.com/openjdk/riscv-port/pull/34 > 
 [1] Of course, the "CompressibleRegion" is good, I like it; and this idea is not from myself. 
 [2] For a simple example, a much commonly used fixed-length movptr() uses up six 4-byte instructions (lui+addi+slli+addi+slli+addi, MIPS alike instructions using arithmetical calculations with signed extensions, but not anyone's fault :-) ), while the AArch64 counterpart only takes three 4-byte instructions (movz+movk+movk). They are both going to mov a 48-bit immediate. After accumulation, the size differs quite a lot. 
 [3] 2-byte instructions have fewer bits, so comes shorter immediate encoding etc. compared to the 4-byte counterparts. After we transform patchable instructions (ones at marks of patchable relocations, etc.) to 2-byte ones, when they are patched to a larger value or farther distances afterward, it is possible that they sadly find themselves, the shorter instructions, cannot cover the newly patched value. So we need to exclude patchable instructions (at the relocation marks etc.) from being compressed. 
 [4] https://github.com/openjdk/jdk/blob/7f3250d71c4866a64eb73f52140c669fe90f122f/src/hotspot/cpu/riscv/riscv.ad#L10032-L10035 <https://github.com/openjdk/jdk/blob/7f3250d71c4866a64eb73f52140c669fe90f122f/src/hotspot/cpu/riscv/riscv.ad#L10032-L10035 > 
 [5] https://digitalassets.lib.berkeley.edu/etd/ucb/text/Waterman_berkeley_0028E_15908.pdf <https://digitalassets.lib.berkeley.edu/etd/ucb/text/Waterman_berkeley_0028E_15908.pdf >, Page 64: "5.4 The RVC Extension, Performance Implications" 
 [6] https://github.com/zhengxiaolinX/jdk/tree/REBASE-rvc-beautify <https://github.com/zhengxiaolinX/jdk/tree/REBASE-rvc-beautify > 
 [7] https://github.com/openjdk/jdk/blob/7f3250d71c4866a64eb73f52140c669fe90f122f/src/hotspot/cpu/riscv/riscv.ad#L9873 <https://github.com/openjdk/jdk/blob/7f3250d71c4866a64eb73f52140c669fe90f122f/src/hotspot/cpu/riscv/riscv.ad#L9873 > 
 [8] https://github.com/openjdk/jdk/blob/7f3250d71c4866a64eb73f52140c669fe90f122f/src/hotspot/cpu/riscv/c1_LIRAssembler_riscv.cpp#L1348-L1353 <https://github.com/openjdk/jdk/blob/7f3250d71c4866a64eb73f52140c669fe90f122f/src/hotspot/cpu/riscv/c1_LIRAssembler_riscv.cpp#L1348-L1353 > 
[9] https://github.com/openjdk/jdk/blob/211fab8d361822bbd1a34a88626853bf4a029af5/src/hotspot/share/opto/output.cpp#L3331-L3340 <https://github.com/openjdk/jdk/blob/211fab8d361822bbd1a34a88626853bf4a029af5/src/hotspot/share/opto/output.cpp#L3331-L3340 > 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/riscv-port-dev/attachments/20220930/63f53b74/attachment-0001.htm>