RFR: 8320069: RISC-V: Add Zcb instructions

Mon Jan 8 07:27:23 UTC 2024

On Tue, 19 Dec 2023 14:25:27 GMT, Vladimir Kempik <vkempik at openjdk.org> wrote:

>>> > > We already have "macroses" for load and stores in macroAssembler_riscv.hpp, what's the reason to do compression decision in assembler_riscv.hpp instead ( not saying it's wrong) ?
>>> > > https://github.com/openjdk/jdk/blob/38d94725a1a85156e30b72b325886b0e25d4db03/src/hotspot/cpu/riscv/macroAssembler_riscv.hpp#L880
>>> > 
>>> > 
>>> > No, you are correct I also think this is not optimal. I don't know the background, but it seems like this is the easiest way to add compressed transparently. But to fully utilize C instruction we should favor the x8->x15, we often don't get C due to e.g. BCP is in x22. I think to be able to better utilize C we can't have it so transparent.
>>> > So here I just try to follow the current code, see how lw is changed to c_lw.
>>> 
>>> Not exactly related to this PR, but I also saw a strange behaviour from MacroAssembler's lwu. it was generating lw + and ( a kind of lwu emulation) instead of lwu
>>> 
>>> an example
>>> 
>>> ```
>>>   0.44%  ?  0x0000003fa46a86c8:   slli    t3,t3,0x20
>>>    0.48%  ?  0x0000003fa46a86ca:   addi    t3,t3,-1
>>>   ....
>>>    3.11%  ?  0x0000003fa46a86dc:   lw    a0,0(t1)
>>>    5.34%  ?  0x0000003fa46a86e0:   and    a0,a0,t3
>>> ```
>>> 
>>> Using Assembler::lwu directly resulted in a correctly generated lwu
>> 
>> Interesting. This does not seem to reflect on the code of `MacroAssembler's lwu`. I wonder how could that happen.
>
>> > > > We already have "macroses" for load and stores in macroAssembler_riscv.hpp, what's the reason to do compression decision in assembler_riscv.hpp instead ( not saying it's wrong) ?
>> > > > https://github.com/openjdk/jdk/blob/38d94725a1a85156e30b72b325886b0e25d4db03/src/hotspot/cpu/riscv/macroAssembler_riscv.hpp#L880
>> > > 
>> > > 
>> > > No, you are correct I also think this is not optimal. I don't know the background, but it seems like this is the easiest way to add compressed transparently. But to fully utilize C instruction we should favor the x8->x15, we often don't get C due to e.g. BCP is in x22. I think to be able to better utilize C we can't have it so transparent.
>> > > So here I just try to follow the current code, see how lw is changed to c_lw.
>> > 
>> > 
>> > Not exactly related to this PR, but I also saw a strange behaviour from MacroAssembler's lwu. it was generating lw + and ( a kind of lwu emulation) instead of lwu
>> > an example
>> > ```
>> >   0.44%  ?  0x0000003fa46a86c8:   slli    t3,t3,0x20
>> >    0.48%  ?  0x0000003fa46a86ca:   addi    t3,t3,-1
>> >   ....
>> >    3.11%  ?  0x0000003fa46a86dc:   lw    a0,0(t1)
>> >    5.34%  ?  0x0000003fa46a86e0:   and    a0,a0,t3
>> > ```
>> > 
>> > 
>> >     
>> >       
>> >     
>> > 
>> >       
>> >     
>> > 
>> >     
>> >   
>> > Using Assembler::lwu directly resulted in a correctly generated lwu
>> 
>> Interesting. This does not seem to reflect on the code of `MacroAssembler's lwu`. I wonder how could that happen.
> 
> If you take this PR https://github.com/openjdk/jdk/pull/17046/files#diff-7a5c3ed05b6f3f06ed1c59f5fc2a14ec566a6a5bd1d09606115767daa99115bdR3717 and change explicit Assembler::lwu() to lwu() then you are likely to see this issue

Thank you @VladimirKempik @RealFYang

-------------

PR Comment: https://git.openjdk.org/jdk/pull/17122#issuecomment-1880494200