<div class="__aliyun_email_body_block"><div  style="line-height:1.7;font-family:Tahoma,Arial,STHeiti,SimSun;font-size:14.0px;color:#000000;"><div  style="clear:both;"><span >Hi team,</span></div><div  style="clear:both;"><br ><div  style="clear:both;">Following up on the performance data on SPECjbb2015 (composite mode) for RVC.</div><div  style="clear:both;"><br ></div><div  style="clear:both;">These days I have been going in for some performance data on SPECjbb2015.</div><div  style="clear:both;"><br ></div><div  style="clear:both;">Shortly, there seems to be a (maybe) 1.5%~2.5% performance gain of mutators under some observations and it might be reasonable for it aligns the results I have observed. Results are at [0].</div><div  style="clear:both;"><br ></div><div  style="clear:both;">Saying "maybe" because the SPECjbb2015 results on my Hifive Unmatched board seem to have a ~±5% fluctuations and I think they are reasonable too. So there's a question of if the seeming performance gain is legal or not.</div><div  style="clear:both;"><br ></div><div  style="clear:both;">Wrote a simple program to calculate the average max-JOPS results for convenience.</div><div  style="clear:both;"><br ></div><div  style="clear:both;">(There has been a result from philosophers evaluating the "whitelist mode" implementation of RVC, see [1]; I follow a similar style.)</div><div  style="clear:both;"><br ></div><div  style="clear:both;">Let us have </div><div  style="clear:both;">[A] RVC branch at [2] but without the histogram patch</div><div  style="clear:both;">[B] The simple unaligned access patch at [3] for I am interested in the unaligned access thing (though reading from the results afterward it seems to behave normally, having nothing special)</div><div  style="clear:both;"><br ></div><div  style="clear:both;">1. [A] + [B] + g1</div><div  style="clear:both;"><a  href="http://cr.openjdk.java.net/~xlinzheng/rvc-size/performance-specjbb2015/g1.1.jpg" target="_blank">http://cr.openjdk.java.net/~xlinzheng/rvc-size/performance-specjbb2015/g1.1.jpg</a></div><div  style="clear:both;">Mutators seem to have a 1.69% gain; The T.TEST result, shows the confidence level is only 62%, with 2-tailed.</div><div  style="clear:both;"><br ></div><div  style="clear:both;">2. [A] + [B] + parallel gc</div><div  style="clear:both;"><a  href="http://cr.openjdk.java.net/~xlinzheng/rvc-size/performance-specjbb2015/parallel.1.jpg" target="_blank">http://cr.openjdk.java.net/~xlinzheng/rvc-size/performance-specjbb2015/parallel.1.jpg</a></div><div  style="clear:both;">There seems to have a 2.62% gain; The T.TEST result shows the confidence level is 98.6%, so seems okay.</div><div  style="clear:both;"><br ></div><div  style="clear:both;">3. [A] + g1</div><div  style="clear:both;"><a  href="http://cr.openjdk.java.net/~xlinzheng/rvc-size/performance-specjbb2015/g1.2.jpg" target="_blank">http://cr.openjdk.java.net/~xlinzheng/rvc-size/performance-specjbb2015/g1.2.jpg</a></div><div  style="clear:both;">Seems a 3.64% gain? A confidence level 99.1% is shown by Excel; but I doubt it a bit because I have never observed such data though. The max-JOPS data at last are too high to be considered normal. I guess my board overdosed at that time, so I keep a reserved attitude toward it.</div><div  style="clear:both;"><br ></div><div  style="clear:both;">4. [A] + parallel gc</div><div  style="clear:both;"><a  href="http://cr.openjdk.java.net/~xlinzheng/rvc-size/performance-specjbb2015/parallel.2.jpg" target="_blank">http://cr.openjdk.java.net/~xlinzheng/rvc-size/performance-specjbb2015/parallel.2.jpg</a></div><div  style="clear:both;">I just invoked this yesterday so the sample data is not enough. I didn't drop the lowest/highest results this time accordingly.</div><div  style="clear:both;">Showing a 1.5% gain, the confidence level is only 70% though. (maybe samples are not big enough)</div><div  style="clear:both;"><br ></div><div  style="clear:both;">Evaluated on a general Hifive Unmatched board which (seemingly) we all have, so the results should be reproducible I guess.</div><div  style="clear:both;"><br ></div><div  style="clear:both;">Though I believe there should be performance gain theoretically in generated code for the potential "I-cache enlargement" from RVC's code size reduction, well, for this feature currently I think no regression is enough though. The performance gain from RVC is a special bonus to me (or to us), so this post is just showing some evaluations to follow up on the performance aspect mentioned weeks ago.</div><div  style="clear:both;"><br ></div><div  style="clear:both;">Accordingly, going to submit a PR for the rest part of RVC (to implement the "blacklist mode").</div><div  style="clear:both;"><br ></div><div  style="clear:both;">Thanks,</div><div  style="clear:both;">Xiaolin</div><div  style="clear:both;"><br ></div><div  style="clear:both;">[0] <a  href="http://cr.openjdk.java.net/~xlinzheng/rvc-size/performance-specjbb2015/" target="_blank">http://cr.openjdk.java.net/~xlinzheng/rvc-size/performance-specjbb2015/</a></div><div  style="clear:both;">[1] <a  href="https://mail.openjdk.org/pipermail/riscv-port-dev/2022-September/000629.html" target="_blank">https://mail.openjdk.org/pipermail/riscv-port-dev/2022-September/000629.html</a></div><div  style="clear:both;">[2] <a  href="https://github.com/zhengxiaolinX/jdk/commits/REBASE-rvc-beautify-histogram" target="_blank">https://github.com/zhengxiaolinX/jdk/commits/REBASE-rvc-beautify-histogram</a></div><span >[3] <a  href="https://github.com/zhengxiaolinX/jdk/commit/f9e28e72ce1ac51b3da1a501e8ea33eaf076c343" target="_blank">https://github.com/zhengxiaolinX/jdk/commit/f9e28e72ce1ac51b3da1a501e8ea33eaf076c343</a></span></div><div  style="clear:both;"><br /></div><blockquote  style="margin-right:0;margin-top:0;margin-bottom:0;font-family:Tahoma,Arial,STHeiti,SimSun;font-size:14.0px;color:#000000;"><div  style="clear:both;">------------------------------------------------------------------</div><div  style="clear:both;">From:yangfei <yangfei@iscas.ac.cn></div><div  style="clear:both;">Send Time:2022年9月17日(星期六) 21:12</div><div  style="clear:both;">To:郑孝林(云矅) <yunyao.zxl@alibaba-inc.com></div><div  style="clear:both;">Cc:riscv-port-dev <riscv-port-dev@openjdk.org></div><div  style="clear:both;">Subject:Re: Discuss the RVC implementation</div><div  style="clear:both;"><br /></div><p >
        Hi Xiaolin,
</p>
<p >
        <br >
</p>
<p >
        Your new proposal for supporting the RVC extension looks interesting.
</p>
<p >
        <br >
</p>
<p >
        May I ask if you have any performance data including code size measured?
</p>
<p >
        <br >
</p>
<p >
        Also it's appreciated if you have more details about the issue with MachBranch nodes.
</p>
<p >
        <br >
</p>
<p >
        Thanks,
</p>
<p >
        Fei
</p>
<br >

        -----Original Messages-----<br >
<b >From:</b><span  id="rc_from">"Xiaolin Zheng" <yunyao.zxl@alibaba-inc.com></span><br >
<b >Sent Time:</b><span  id="rc_senttime">2022-09-15 10:52:59 (Thursday)</span><br >
<b >To:</b> riscv-port-dev <riscv-port-dev@openjdk.org><br >
<b >Cc:</b> <br >
<b >Subject:</b> Discuss the RVC implementation<br >
<br >
        
                <div  class=" __aliyun_node_has_color" style="line-height:1.7;font-family:Tahoma,Arial,STHeiti,SimSun;font-size:14.0px;color:#000000;">
                        <div  style="clear:both;">
                                Hi team,
                        </div>
                        <div  style="clear:both;">
                                <br >
                        </div>
                        <div  style="clear:both;">
                                I am going to describe a different implementation of RVC for our backend.
                        </div>
                        <div  style="clear:both;">
                                <br >
                        </div>
                        <div  style="clear:both;">
                                <br >
                        </div>
                        <div  style="clear:both;">
                                ## Background<br >
                        </div>
                        <div  style="clear:both;">
                                <br >
                        </div>
                        <div  style="clear:both;">
                                The RISC-V C extension, also known as RVC, could transform 4-byte instructions to 2-byte counterparts when eligible <span  class=" __aliyun_node_has_color __aliyun_node_has_bgcolor" style="color:#000000;font-family:Tahoma,Arial,STHeiti,SimSun;font-size:14.0px;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:.0px;text-transform:none;white-space:normal;word-spacing:.0px;background-color:#ffffff;text-decoration-thickness:initial;text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline;">(for example, as the manual, Rd/Rs of instruction ranges from [x8,x15] might be one common requirement, etc.)</span>.<br >
                        </div>
                        <div  style="clear:both;">
                                <div  style="clear:both;">
                                        <br >
                                </div>
                                <div  style="clear:both;">
                                        <br >
                                </div>
                                <div  style="clear:both;">
                                        ## The current implementation in the Hotspot
                                </div>
                                <div  style="clear:both;">
                                        <br >
                                </div>
                                <div  style="clear:both;">
                                        The current implementation[0] is a transient one, introducing a "CompressibleRegion" by using RTTI[1] to indicate that instructions inside these regions can be safely substituted by the RVC counterparts, if convertible; and the implementation also uses a, say, "whitelist mode" by using the "CompressibleRegion" mentioned above to "manually mark out safe regions", then batch emit them if could. However, after a deeper look, we might discover the current "whitelist mode" has several shortages:
                                </div>
                                <div  style="clear:both;">
                                        <br >
                                </div>
                                <div  style="clear:both;">
                                        <br >
                                </div>
                                <div  style="clear:both;">
                                        ## Shortages of the current implementation
                                </div>
                                <div  style="clear:both;">
                                        <br >
                                </div>
                                <div  style="clear:both;">
                                        1. Coverages:
                                </div>
                                <div  style="clear:both;">
                                        The current implementation only covers some of C2 match rules, and only some small part of stub code, so there is obviously far more space to reduce the total code size. In my observations, some RISC-V instruction sequences generally occupy a bit more space than AArch64 ones[2]. With the new implementations, we could achieve a code size level alike AArch64's generated code. Some better, some still worse than AArch64 one in my simple observation.
                                </div>
                                <div  style="clear:both;">
                                        <br >
                                </div>
                                <div  style="clear:both;">
                                        2. Though safe, I'd say it's very much not easy to maintain. The background is, most of the patchable instructions cannot be easily transformed into their shorter counterparts[3], and they need to be prevented from being compressed. So comes the question: we must make sure no patchable relocation is inside the range of a "CompressibleRegion". For example, the string comparison intrinsic function[4] looks very delicious: transforming it and its siblings may result in a yummy compression rate. But programmers might have to check lots of its callees to find if there is just one patchable relocation hidden inside that causes the whole intrinsic incompressible. This could cause extra burden for programmers, so I bet no one would like to add "CompressibleRegion" for his/her code :-)
                                </div>
                                <div  style="clear:both;">
                                        <br >
                                </div>
                                <div  style="clear:both;">
                                        3. Performance:
                                </div>
                                <div  style="clear:both;">
                                        Better performance of generated code is a little side effect this extension gives us, the smaller the I$ size, the better performance though - please see Andrew Waterman's paper[5] for more reference there. Anyway, it looks like a higher general compression rate is better for performance.
                                </div>
                                <div  style="clear:both;">
                                        <br >
                                </div>
                                <div  style="clear:both;">
                                        The main issue here is the granularity of "CompressibleRegion" is a bit coarse. "Why not exclude the incompressible parts" may come up to us naturally. And after some diggings, we may find: we just need to exclude countable places that would be patched back (mostly relocations), and several code slices with a fixed length, which will be calculated, such as "emit_static_call_stub". All remaining instructions could be safely transformed into RVC counterparts if eligible. So maybe, say, the "blacklist mode"?
                                </div>
                                <div  style="clear:both;">
                                        <br >
                                </div>
                                <div  style="clear:both;">
                                        <br >
                                </div>
                                <div  style="clear:both;">
                                        ## The new implementation
                                </div>
                                <div  style="clear:both;">
                                        <br >
                                </div>
                                <div  style="clear:both;">
                                        To implement the "blacklist mode" in the backend, we need two things:
                                </div>
                                <div  style="clear:both;">
                                        1. an "IncompressibleRegion", indicating instructions inside it should remain in their normal 4-byte form no matter what happens.
                                </div>
                                <div  style="clear:both;">
                                        2. a simple strategy to exclude patchable instructions, mainly for relocations. So we can see the new strategy is highly bounded to relocations' positions:
                                </div>
                                <div  style="clear:both;">
                                        We all know the "relocate()" in Hotspot VM is a mark that only has an explicit "start point" without an end point, and some of them could be patched back. Therefore, we can use a simple strategy: introduce a lambda as another argument to assign "end point" semantics to the relocations, for completing our requirements without extra costs. For example:
                                </div>
                                <div  style="clear:both;">
                                        <br >
                                </div>
                                <div  style="clear:both;">
                                        Originally:
                                </div>
                                <div  style="clear:both;">
                                        ```
                                </div>
                                <div  style="clear:both;">
                                        __ relocate(safepoint_pc.rspec());
                                </div>
                                <div  style="clear:both;">
                                        __ la(t0, safepoint_pc.target());
                                </div>
                                <div  style="clear:both;">
                                        __ sd(t0, Address(xthread, JavaThread::saved_exception_pc_offset()));
                                </div>
                                <div  style="clear:both;">
                                        ```
                                </div>
                                <div  style="clear:both;">
                                        <br >
                                </div>
                                <div  style="clear:both;">
                                        After introducing a simple lambda as an extra argument:
                                </div>
                                <div  style="clear:both;">
                                        ```
                                </div>
                                <div  style="clear:both;">
                                        __ relocate(safepoint_pc.rspec(), [&] {   // The relocate() hides an "IncompressibleRegion" in it
                                </div>
                                <div  style="clear:both;">
                                          __ la(t0, safepoint_pc.target());       // This patchable instruction sequence is incompressible
                                </div>
                                <div  style="clear:both;">
                                        });
                                </div>
                                <div  style="clear:both;">
                                        _ sd(t0, Address(xthread, JavaThread::saved_exception_pc_offset()));
                                </div>
                                <div  style="clear:both;">
                                        ```
                                </div>
                                <div  style="clear:both;">
                                        <br >
                                </div>
                                <div  style="clear:both;">
                                        Well, simple but effective. Excluding such countable dynamically patchable places and unifying all relocations, all other instructions can be safely transformed, without messing up the current code style. Programmers could just keep aligning the same style; most of the time they have no need to care about whether the RVC exists or not and things get converted automatically. The proposed new sample code is again, here[6].
                                </div>
                                <div  style="clear:both;">
                                        <br >
                                </div>
                                <div  style="clear:both;">
                                        <br >
                                </div>
                                <div  style="clear:both;">
                                        ## Other things worth being noticed
                                </div>
                                <div  style="clear:both;">
                                        <br >
                                </div>
                                <div  style="clear:both;">
                                        1. Instruction patching issues<br >
                                </div>
                                <div  style="clear:both;">
                                        With the C extension, the backend mixes with both 2-byte and 4-byte instructions. It gets a little CISC alike. We know the Hotspot would patch instructions when code is running at full speed, such as call instructions, nops used for deoptimizations (the nops at the entry points, and post-call nops after loom). Instruction patching is delicate so we must carefully handle such places, to keep these 4-byte instructions from spanning cachelines. Though remaining a 4-byte normal form even with RVC, they might sit at a 2-byte aligned boundary. Such cases should definitely not happen, for patching such places spanning cachelines would lose the atomicity. So shortly, we must properly align them, such as [7][8]. Such a problem could exist with RVC, no matter "whitelist mode" or "blacklist mode". It is a general problem for instruction patching. I will add more strong assertions to the potential places (trampoline_call might be a very good spot, for patchable "static_call", "opt_virtual" and "virtual" relocations) to check alignment in the future patches.
                                </div>
                                <div  style="clear:both;">
                                        <br >
                                </div>
                                <div  style="clear:both;">
                                        2. MachBranch Nodes
                                </div>
                                <div  style="clear:both;">
                                        And MachBranch nodes: they are not easy to be tamed because the "fake label"[9] in PhaseOutput::scratch_emit_size() cannot tell us the real distance of the label. But we can leave them alone in this discussion, for there will be patches to handle those afterward.
                                </div>
                                <div  style="clear:both;">
                                        <br >
                                </div>
                                <div  style="clear:both;">
                                        <br >
                                </div>
                                <div  style="clear:both;">
                                        That's nearly all. Thanks for reaching here despite the verbosity. It would be very nice to receive any suggestions.
                                </div>
                                <div  style="clear:both;">
                                        <br >
                                </div>
                                <div  style="clear:both;">
                                        Best,
                                </div>
                                <div  style="clear:both;">
                                        Xiaolin
                                </div>
                                <div  style="clear:both;">
                                        <br >
                                </div>
                                <div  style="clear:both;">
                                        <br >
                                </div>
                                <div  style="clear:both;">
                                        [0] Original patch: <a  href="https://github.com/openjdk/riscv-port/pull/34" target="_blank">https://github.com/openjdk/riscv-port/pull/34</a> 
                                </div>
                                <div  style="clear:both;">
                                        [1] Of course, the "CompressibleRegion" is good, I like it; and this idea is not from myself.
                                </div>
                                <div  style="clear:both;">
                                        [2] For a simple example, a much commonly used fixed-length movptr() uses up six 4-byte instructions (lui+addi+slli+addi+slli+addi, MIPS alike instructions using arithmetical calculations with signed extensions, but not anyone's fault :-) ), while the AArch64 counterpart only takes three 4-byte instructions (movz+movk+movk). They are both going to mov a 48-bit immediate. After accumulation, the size differs quite a lot.
                                </div>
                                <div  style="clear:both;">
                                        [3] 2-byte instructions have fewer bits, so comes shorter immediate encoding etc. compared to the 4-byte counterparts. After we transform patchable instructions (ones at marks of patchable relocations, etc.) to 2-byte ones, when they are patched to a larger value or farther distances afterward, it is possible that they sadly find themselves, the shorter instructions, cannot cover the newly patched value. So we need to exclude patchable instructions (at the relocation marks etc.) from being compressed.
                                </div>
                                <div  style="clear:both;">
                                        [4] <a  href="https://github.com/openjdk/jdk/blob/7f3250d71c4866a64eb73f52140c669fe90f122f/src/hotspot/cpu/riscv/riscv.ad#L10032-L10035" target="_blank">https://github.com/openjdk/jdk/blob/7f3250d71c4866a64eb73f52140c669fe90f122f/src/hotspot/cpu/riscv/riscv.ad#L10032-L10035</a> 
                                </div>
                                <div  style="clear:both;">
                                        [5] <a  href="https://digitalassets.lib.berkeley.edu/etd/ucb/text/Waterman_berkeley_0028E_15908.pdf" target="_blank">https://digitalassets.lib.berkeley.edu/etd/ucb/text/Waterman_berkeley_0028E_15908.pdf</a>, Page 64: "5.4 The RVC Extension, Performance Implications"
                                </div>
                                <div  style="clear:both;">
                                        [6] <a  href="https://github.com/zhengxiaolinX/jdk/tree/REBASE-rvc-beautify" target="_blank">https://github.com/zhengxiaolinX/jdk/tree/REBASE-rvc-beautify</a> 
                                </div>
                                <div  style="clear:both;">
                                        [7] <a  href="https://github.com/openjdk/jdk/blob/7f3250d71c4866a64eb73f52140c669fe90f122f/src/hotspot/cpu/riscv/riscv.ad#L9873" target="_blank">https://github.com/openjdk/jdk/blob/7f3250d71c4866a64eb73f52140c669fe90f122f/src/hotspot/cpu/riscv/riscv.ad#L9873</a> 
                                </div>
                                <div  style="clear:both;">
                                        [8] <a  href="https://github.com/openjdk/jdk/blob/7f3250d71c4866a64eb73f52140c669fe90f122f/src/hotspot/cpu/riscv/c1_LIRAssembler_riscv.cpp#L1348-L1353" target="_blank">https://github.com/openjdk/jdk/blob/7f3250d71c4866a64eb73f52140c669fe90f122f/src/hotspot/cpu/riscv/c1_LIRAssembler_riscv.cpp#L1348-L1353</a> 
                                </div>
<span >[9] <a  href="https://github.com/openjdk/jdk/blob/211fab8d361822bbd1a34a88626853bf4a029af5/src/hotspot/share/opto/output.cpp#L3331-L3340" target="_blank">https://github.com/openjdk/jdk/blob/211fab8d361822bbd1a34a88626853bf4a029af5/src/hotspot/share/opto/output.cpp#L3331-L3340</a></span> 
                        </div>
                </div>
        
</blockquote></div></div>