[riscv-port] RFR: 8278322: riscv: Support RVC: compressed instructions

Thu Dec 9 03:57:31 UTC 2021

On Tue, 7 Dec 2021 04:46:10 GMT, Xiaolin Zheng <xlinzheng at openjdk.org> wrote:

> Hi team,
>
> This patch support RISC-V RVC extension. It can introduce:
> * 21% code size reduction in template interpreter generated code
> * 20%~25% code size reduction in C1 generated code, evaluated by a common SpringBoot program
> * 15%~20% code size reduction in C2 generated code, evaluated by a common SpringBoot program
>
> In my observation, the code size footprint could be reduced to nearly a level of the AArch64 back-end. About the performance, there seems a stable ~0.7% performance improvement on SPECjbb2015 on one HiFive Unleashed board, considering the code density increase. I think the performance aspect might be a speculative behavior on different hardware implementations because C910's performance might be better than that, but HiFive Unleashed may be more general.
>
> Things about this patch:
> * If an instruction is compressible, then we will implicitly emit a 16-bit compressed instruction instead of the 32-bit instruction in Assembler.
> * About the `_nc` postfix of some of Assembler instructions: we know a bunch of places should be reserved for patching, where we cannot change them into compressed instructions. `_nc` is short for `not compressible` - with this, those instructions should keep their origin 4-byte form and remain uncompressed.
> * There are things not easy to compress like MachBranchNodes. Please see the comments in the code - currently this patch does not support this. We will support this improvement in patches coming afterward.
>
> The macros after their expansion might be like:
>
> void andr(Register Rd, Register Rs1, Register Rs2) {
> +  {
> +    Register src = noreg;
> +    if (UseRVC && Rs1->is_compressed_valid() && Rs2->is_compressed_valid() &&
> +        ((src = Rs1, Rs2 == Rd) || (src = Rs2, Rs1 == Rd))) {
> +      and_c(Rd, src);
> +      return;
> +    }
> +  }
>   unsigned insn = 0;
>   patch((address)&insn, 6, 0, 0b0110011);
>   patch((address)&insn, 14, 12, 0b111);
>   patch((address)&insn, 31, 25, 0b0000000);
>   patch_reg((address)&insn, 7, Rd);
>   patch_reg((address)&insn, 15, Rs1);
>   patch_reg((address)&insn, 20, Rs2);
>   emit(insn);
> };
>
>
> For further information, please see comments in `assembler_riscv_cext.hpp`.
>
> --
>
> This patch may need some time to acquire a review. I have been polishing this patch for quite a long time and it might seem stable under a bunch of full-tier tests and some tier1 jdk/hotspot jtreg tests. But I think we might mark it `experimental` at first, though it is turned on by default. [The original patch](https://github.com/riscv-collab/riscv-openjdk/pull/7), just for reference, might be quite different from the current one.
>
> I am pleased to receive any suggestion.
>
> Thanks,
> Xiaolin

Thanks for your reviews, Felix and Yadong, and also thanks for your advice -- also quite sorry for the late reply.

I think I totally understand your concerns.

Firstly please let me talk about my opinion about this.
Most compressible instructions could be safely compressed, say, ~25% of total RISCV instructions, except only several places like
1. patchable code, for instance: `la_patchable` related code;
2. and several places requiring a fixed length like `BoxLockNode`.
We have found these places, excluded them in this patch, and wrapped them for developers into `ld_patchable`, `jalr_patchable` in this patch. If you feel unsafe for misuse from developers, maybe we can further carry out some strong guarantees to prevent this from happening. We can disable `UseRVC` by default for now, the jvm shoud be same behavior as before. And after enough testing, we can have more confidence to enable it.

I think your suggestions are indeed reasonable in your aspects, for safety.

-----

### About our first scheme:

We may mark some instructions as compressible by adding function arguments. But compressed instructions can be used in almost every place in the code cache, like C1, C2, interpreter, and stub code. The range is quite big -- for instance, after some time we may want to compress some intrinsics, say, `C2_MacroAssembler::string_indexof_char_short()`.

After `addi(Rd, Rs, imm)` becomes `addi(Rd, Rs, imm, bool compressible)`, it will be

bind(MATCH1);
addi(index, index, 1);   ->    addi(index, index, 1, true);
j(MATCH);                ->    j(MATCH, true);

bind(MATCH2);
addi(index, index, 2);   ->    addi(index, index, 2, true);
j(MATCH);                ->    j(MATCH, true)

bind(MATCH3);
addi(index, index, 3);   ->   ...
j(MATCH);

bind(MATCH4);
addi(index, index, 4);
j(MATCH);

bind(MATCH5);
addi(index, index, 5);
j(MATCH);

We might not add extra explicit arguments to them since this may make maintenance quite hard. :-)

-----
### About our second scheme:

Considering the `CompressibleRegion` solution, this could indeed, largely expand the range of code included. But we may face some situations:

As code function and features begin to complete, we may continue to make code compress further denser afterward. If we want to use this region to include a quite large range of function, of which we think is safe to compress.

#define __ _masm->
intrinsic A()
{
CompressibleRegion cr;                 // The CompressibleRegion

.. balabala ...
.. balabala ...
__ macroassembler_function();          // Here may include a tiny 'la_patchable'
.. balabala ...
.. balabala ...
}

Inside the `macroassembler_function()`, a `la_patchable` might be hidden somewhere...
What should we do? It seems that we need another `UncompressibleRegion ur` to exclude such a smaller region.

#define __ _masm->
intrinsic A()
{
CompressibleRegion cr;

.. balabala ...
.. balabala ...
{
UncompressibleRegion ur;             // Add one UncompressibleRegion
__ macroassembler_function();
}
.. balabala ...
.. balabala ...
}

As the inherent laws of things, the number of `CompressibleRegion` will inevitably become more and more; the same pattern emerges one by one; hence, the `CompressibleRegion` may fly to every part of the world; hmm, we may think naturally: why don't we remove the `CompressibleRegion` and remain the `UncompressibleRegion`? Because it seems most of the code could get compressed, and only several places could not be compressed. Why not maintain a blacklist for them? ...

So comes

#define __ _masm->
intrinsic A()
{
.. balabala ... (automatically compressed)
.. balabala ... (automatically compressed)
{
__ macroassembler_function_nc();               // nc things :-)
}
.. balabala ... (automatically compressed)
.. balabala ... (automatically compressed)
}

I think finally it may become the solution of this patch (i.e. the blacklist)... :-)

I hope this analysis is reasonable and it could bring you some smiles.

In that the blacklist must be the core of exclusions, I might recommend focusing on the maintenance of the blacklist, because anyway, we might face the blacklist in the future. To relieve your concerns, maybe making `UseRVC` false as default for now is a better plan because it is an one-click switch.

But, to step back, I totally understand your points as well. If you still consider the whitelist scheme might be a good scheme for you, I will choose to use your scheme for now, and add the `CompressibleRegion` and `UncompressibleRegion` gradually. But I assume maybe turning `UseRVC` as false could better solve this?

-----
### About splitting the patch:

Let's come back to our topic. I have considered from your angle and I begin to realize this design might be quite counterintuitive for guys who always use the normal 4-byte instructions to work; and also because of this 'blacklist' mode, which may cause a burden for reviewers to review. But I believe the primary part (the definitions of C-Ext, like `c.j/c.beqz/c.add`... stuff) could be trivial and this could be separated as a primitive patch -- it may relieve the burden for the reviewers.

-----
### About an ideal but specious solution:

In fact, I think the ideal solution might be to use another brandy-new pass to handle this compression logic, at the end of C1's `Compilation::emit_code_body`, C2's `PhaseOutput`, and other `BufferBlobs`' end of emission, to compress instructions in batches after all instructions have emitted to `CodeBlob`s. I think it will prevent all the above-mentioned issues: no hard-to-understand `_nc` things, no black and white lists, and programmers have no need to care about whether there is C-Ext or not. But it might be very difficult and specious:
* First, we need to re-do all relocations, check all paddings like `CodeEntryAlignment`, `OptoLoopAlignment`... also self-written `__ align()`s. If developers want to add ten `nops` to align addresses, I think we lack knowledge about these things just before code installation and make the code not 'what they see is what they get'.
* Second, we need to re-do all `OopMap`s and other related stuff. These delicate data structures are precisely calculated by C2's `BuildOopMaps()`. If we insert a phase at the end of all phases to compress code, it will be no less than doing these stuff once again. Also, for modifying those maps and debuginfos, maybe we need to add APIs to change the data inside them. This will further cause invasions to the public codebase just because RISC-V has this extension -- maybe other guys will not allow us to do such a big change. Besides, I speculate that it should be limited to our back-end only.

I consider this strategy has other limits except for those above-mentioned ones. So I did not choose this plan.

-----

It might be so energy-consuming for me to write so many words in English to explain my reasoning and write analysis in detail. If there are some typos or errors, please don't mind. I wonder if is there a need to arrange a small meeting or not -- we may need to split the patch, discuss friendly about solutions and precautions, and gradually reach a consensus. I think for this patch, we might finally reach that place.

Thanks again for your reviews and advice.

Sincerely,
Xiaolin

-------------

PR: https://git.openjdk.java.net/riscv-port/pull/24