RFR: 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" [v3]

Thu Jun 6 15:41:58 UTC 2024

On Wed, 29 May 2024 08:46:51 GMT, Fei Gao <fgao at openjdk.org> wrote:

>> On LP64 systems, if the heap can be moved into low virtual address space (below 4GB) and the heap size is smaller than the interesting threshold of 4 GB, we can use unscaled decoding pattern for narrow klass decoding. It means that a generic field reference can be decoded by:
>> 
>> cast<64> (32-bit compressed reference) + field_offset
>> 
>> 
>> When the `field_offset` is an immediate, on aarch64 platform, the unscaled decoding pattern can match perfectly with a direct addressing mode, i.e., `base_plus_offset`, supported by `LDR/STR` instructions. But for certain data width, not all immediates can be encoded in the instruction field of `LDR/STR` [[1]](https://github.com/openjdk/jdk/blob/8db7bad992a0f31de9c7e00c2657c18670539102/src/hotspot/cpu/aarch64/assembler_aarch64.inline.hpp#L33). The ranges are different as data widths vary.
>> 
>> For example, when we try to load a value of long type at offset of `1030`, the address expression is `(AddP (DecodeN base) 1030)`. Before the patch, the expression was matching with `operand indOffIN()`. But, for 64-bit `LDR/STR`, signed immediate byte offset must be in the range -256 to 255 or positive immediate byte offset must be a multiple of 8 in the range 0 to 32760 [[2]](https://developer.arm.com/documentation/ddi0602/2023-09/Base-Instructions/LDR--immediate---Load-Register--immediate--?lang=en). `1030` can't be encoded in the instruction field. So, after matching, when we do checking for instruction encoding, the assertion would fail.
>> 
>> In this patch, we're going to filter out invalid immediates when deciding if current addressing mode can be matched as `base_plus_offset`. We introduce `indOffIN4/indOffLN4` and `indOffIN8/indOffLN8` for 32-bit data type and 64-bit data type separately in the patch. E.g., for `memory4`, we remove the generic `indOffIN/indOffLN`, which matches wrong unscaled immediate range, and replace them with `indOffIN4/indOffLN4` instead.
>> 
>> Since 8-bit and 16-bit `LDR/STR` instructions also support the unscaled decoding pattern, we add the addressing mode in the lists of `memory1` and `memory2` by introducing `indOffIN1/indOffLN1` and `indOffIN2/indOffLN2`.
>> 
>> We also remove unused operands `indOffI/indOffl/indOffIN/indOffLN` to avoid misuse.
>> 
>> Tier 1-3 passed on aarch64.
>
> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision:
> 
>  - Add the assertion back and merge matchrules with a better predicate
>  - Merge branch 'master' into fg8319690
>  - Remove unused immIOffset/immLOffset
>  - Merge branch 'master' into fg8319690
>  - 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug"
>    
>    On LP64 systems, if the heap can be moved into low virtual
>    address space (below 4GB) and the heap size is smaller than the
>    interesting threshold of 4 GB, we can use unscaled decoding
>    pattern for narrow klass decoding. It means that a generic field
>    reference can be decoded by:
>    ```
>    cast<64> (32-bit compressed reference) + field_offset
>    ```
>    
>    When the `field_offset` is an immediate, on aarch64 platform, the
>    unscaled decoding pattern can match perfectly with a direct
>    addressing mode, i.e., `base_plus_offset`, supported by LDR/STR
>    instructions. But for certain data width, not all immediates can
>    be encoded in the instruction field of LDR/STR[1]. The ranges are
>    different as data widths vary.
>    
>    For example, when we try to load a value of long type at offset of
>    `1030`, the address expression is `(AddP (DecodeN base) 1030)`.
>    Before the patch, the expression was matching with
>    `operand indOffIN()`. But, for 64-bit LDR/STR, signed immediate
>    byte offset must be in the range -256 to 255 or positive immediate
>    byte offset must be a multiple of 8 in the range 0 to 32760[2].
>    `1030` can't be encoded in the instruction field. So, after
>    matching, when we do checking for instruction encoding, the
>    assertion would fail.
>    
>    In this patch, we're going to filter out invalid immediates
>    when deciding if current addressing mode can be matched as
>    `base_plus_offset`. We introduce `indOffIN4/indOffLN4` and
>    `indOffIN8/indOffLN8` for 32-bit data type and 64-bit data
>    type separately in the patch. E.g., for `memory4`, we remove
>    the generic `indOffIN/indOffLN`, which matches wrong unscaled
>    immediate range, and replace them with `indOffIN4/indOffLN4`
>    instead.
>    
>    Since 8-bit and 16-bit LDR/STR instructions also support the
>    unscaled decoding pattern, we add the addressing mode in the
>    lists of `memory1` and `memory2` by introducing
>    `indOffIN1/indOffLN1` and `indOffIN2/...

On 6/6/24 13:42, Fei Gao wrote:
> Sorry, did you mean loading from base plus offset, like `ldr x0, [x6,
> #8]` or `ldr x0, [x6, x7]`, takes one more cycle than loading from base
> register only, like `ldr x0, [x6]`?  Does the address addition take one
> cycle? 

We know that, on many Arm cores, Store μOPs are split into address and
data μOPs which are executed separately. That doesn't usually cause
any additional delay, because cores execute many operations in
parallel, so an address generation μOP for base+offset very probably
will execute in parallel with some previous instructions, meaning that
the target address is ready before it is needed. This split of address
generation must happen regardless of whether a store (or a load) is a
single instruction

`str x0, [x1, #80]`

or a pair of instructions

`add r8, x1, #80; str x0, [x8]`.

Of course, a pair of instructions occupies twice as much icache space,
and you can run out of instruction decode bandwidth. However, in the
case of Unsafe operations, I don't believe that an occasional
unnecessary two-instruction operation will result in a performance
regression.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/16991#issuecomment-2152841468