RFR: 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" [v4]

Tue Jul 2 10:19:27 UTC 2024

On Mon, 24 Jun 2024 13:51:34 GMT, Fei Gao <fgao at openjdk.org> wrote:

>> On LP64 systems, if the heap can be moved into low virtual address space (below 4GB) and the heap size is smaller than the interesting threshold of 4 GB, we can use unscaled decoding pattern for narrow klass decoding. It means that a generic field reference can be decoded by:
>> 
>> cast<64> (32-bit compressed reference) + field_offset
>> 
>> 
>> When the `field_offset` is an immediate, on aarch64 platform, the unscaled decoding pattern can match perfectly with a direct addressing mode, i.e., `base_plus_offset`, supported by `LDR/STR` instructions. But for certain data width, not all immediates can be encoded in the instruction field of `LDR/STR` [[1]](https://github.com/openjdk/jdk/blob/8db7bad992a0f31de9c7e00c2657c18670539102/src/hotspot/cpu/aarch64/assembler_aarch64.inline.hpp#L33). The ranges are different as data widths vary.
>> 
>> For example, when we try to load a value of long type at offset of `1030`, the address expression is `(AddP (DecodeN base) 1030)`. Before the patch, the expression was matching with `operand indOffIN()`. But, for 64-bit `LDR/STR`, signed immediate byte offset must be in the range -256 to 255 or positive immediate byte offset must be a multiple of 8 in the range 0 to 32760 [[2]](https://developer.arm.com/documentation/ddi0602/2023-09/Base-Instructions/LDR--immediate---Load-Register--immediate--?lang=en). `1030` can't be encoded in the instruction field. So, after matching, when we do checking for instruction encoding, the assertion would fail.
>> 
>> In this patch, we're going to filter out invalid immediates when deciding if current addressing mode can be matched as `base_plus_offset`. We introduce `indOffIN4/indOffLN4` and `indOffIN8/indOffLN8` for 32-bit data type and 64-bit data type separately in the patch. E.g., for `memory4`, we remove the generic `indOffIN/indOffLN`, which matches wrong unscaled immediate range, and replace them with `indOffIN4/indOffLN4` instead.
>> 
>> Since 8-bit and 16-bit `LDR/STR` instructions also support the unscaled decoding pattern, we add the addressing mode in the lists of `memory1` and `memory2` by introducing `indOffIN1/indOffLN1` and `indOffIN2/indOffLN2`.
>> 
>> We also remove unused operands `indOffI/indOffl/indOffIN/indOffLN` to avoid misuse.
>> 
>> Tier 1-3 passed on aarch64.
>
> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains seven additional commits since the last revision:
> 
>  - Discard IndOffXX style and let legitimize_address() fix any out-of-range immediate offsets
>  - Merge branch 'master' into fg8319690
>  - Add the assertion back and merge matchrules with a better predicate
>  - Merge branch 'master' into fg8319690
>  - Remove unused immIOffset/immLOffset
>  - Merge branch 'master' into fg8319690
>  - 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug"
>    
>    On LP64 systems, if the heap can be moved into low virtual
>    address space (below 4GB) and the heap size is smaller than the
>    interesting threshold of 4 GB, we can use unscaled decoding
>    pattern for narrow klass decoding. It means that a generic field
>    reference can be decoded by:
>    ```
>    cast<64> (32-bit compressed reference) + field_offset
>    ```
>    
>    When the `field_offset` is an immediate, on aarch64 platform, the
>    unscaled decoding pattern can match perfectly with a direct
>    addressing mode, i.e., `base_plus_offset`, supported by LDR/STR
>    instructions. But for certain data width, not all immediates can
>    be encoded in the instruction field of LDR/STR[1]. The ranges are
>    different as data widths vary.
>    
>    For example, when we try to load a value of long type at offset of
>    `1030`, the address expression is `(AddP (DecodeN base) 1030)`.
>    Before the patch, the expression was matching with
>    `operand indOffIN()`. But, for 64-bit LDR/STR, signed immediate
>    byte offset must be in the range -256 to 255 or positive immediate
>    byte offset must be a multiple of 8 in the range 0 to 32760[2].
>    `1030` can't be encoded in the instruction field. So, after
>    matching, when we do checking for instruction encoding, the
>    assertion would fail.
>    
>    In this patch, we're going to filter out invalid immediates
>    when deciding if current addressing mode can be matched as
>    `base_plus_offset`. We introduce `indOffIN4/indOffLN4` and
>    `indOffIN8/indOffLN8` for 32-bit data type and 64-bit data
>    type separately in the patch. E.g., for `memory4`, we remove
>    the generic `indOffIN/indOffLN`, which matches wrong unscaled
>    immediate range, and replace them with `indOffIN4/indOffLN4`
>    instead.
>    
>    Since 8-bit and 16-bit LDR/STR instructions also support the
>    unscaled decoding...

I think this fixes it. `Unsafe` intrinsifies direct memory access using a `long` as the base address, and it seems C2 needs to be told that the conversion from long->ptr is a nop. Having said that, I'm not sure why this PR causes the extra register copy.

diff --git a/src/hotspot/cpu/aarch64/aarch64.ad b/src/hotspot/cpu/aarch64/aarch64.ad
index 985784c70fa..ed8cf0c698b 100644
--- a/src/hotspot/cpu/aarch64/aarch64.ad
+++ b/src/hotspot/cpu/aarch64/aarch64.ad
@@ -5134,6 +5134,20 @@ operand indOffL(iRegP reg, immLOffset off)
   %}
 %}
 
+operand indOffX2P(iRegL reg, immLOffset off)
+%{
+  constraint(ALLOC_IN_RC(ptr_reg));
+  match(AddP (CastX2P reg) off);
+  op_cost(0);
+  format %{ "[$reg, $off]" %}
+  interface(MEMORY_INTER) %{
+    base($reg);
+    index(0xffffffff);
+    scale(0x0);
+    disp($off);
+  %}
+%}
+
 operand indirectN(iRegN reg)
 %{
   predicate(CompressedOops::shift() == 0);
@@ -5469,7 +5483,7 @@ opclass vmem(indirect, indIndex, indOffI, indOffL, indOffIN, indOffLN);
 // memory is used to define read/write location for load/store
 // instruction defs. we can turn a memory op into an Address
 
-opclass memory(indirect, indIndexScaled, indIndexScaledI2L, indIndexI2L, indIndex, indOffI, indOffL,
+opclass memory(indirect, indIndexScaled, indIndexScaledI2L, indIndexI2L, indIndex, indOffI, indOffL, indOffX2P,
                indirectN, indIndexScaledN, indIndexScaledI2LN, indIndexI2LN, indIndexN, indOffIN, indOffLN);

-------------

PR Comment: https://git.openjdk.org/jdk/pull/16991#issuecomment-2202656142