RFR: 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug" [v3]

Fri May 31 15:17:13 UTC 2024

On Wed, 29 May 2024 18:41:40 GMT, Dean Long <dlong at openjdk.org> wrote:

>> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision:
>> 
>>  - Add the assertion back and merge matchrules with a better predicate
>>  - Merge branch 'master' into fg8319690
>>  - Remove unused immIOffset/immLOffset
>>  - Merge branch 'master' into fg8319690
>>  - 8319690: [AArch64] C2 compilation hits offset_ok_for_immed: assert "c2 compiler bug"
>>    
>>    On LP64 systems, if the heap can be moved into low virtual
>>    address space (below 4GB) and the heap size is smaller than the
>>    interesting threshold of 4 GB, we can use unscaled decoding
>>    pattern for narrow klass decoding. It means that a generic field
>>    reference can be decoded by:
>>    ```
>>    cast<64> (32-bit compressed reference) + field_offset
>>    ```
>>    
>>    When the `field_offset` is an immediate, on aarch64 platform, the
>>    unscaled decoding pattern can match perfectly with a direct
>>    addressing mode, i.e., `base_plus_offset`, supported by LDR/STR
>>    instructions. But for certain data width, not all immediates can
>>    be encoded in the instruction field of LDR/STR[1]. The ranges are
>>    different as data widths vary.
>>    
>>    For example, when we try to load a value of long type at offset of
>>    `1030`, the address expression is `(AddP (DecodeN base) 1030)`.
>>    Before the patch, the expression was matching with
>>    `operand indOffIN()`. But, for 64-bit LDR/STR, signed immediate
>>    byte offset must be in the range -256 to 255 or positive immediate
>>    byte offset must be a multiple of 8 in the range 0 to 32760[2].
>>    `1030` can't be encoded in the instruction field. So, after
>>    matching, when we do checking for instruction encoding, the
>>    assertion would fail.
>>    
>>    In this patch, we're going to filter out invalid immediates
>>    when deciding if current addressing mode can be matched as
>>    `base_plus_offset`. We introduce `indOffIN4/indOffLN4` and
>>    `indOffIN8/indOffLN8` for 32-bit data type and 64-bit data
>>    type separately in the patch. E.g., for `memory4`, we remove
>>    the generic `indOffIN/indOffLN`, which matches wrong unscaled
>>    immediate range, and replace them with `indOffIN4/indOffLN4`
>>    instead.
>>    
>>    Since 8-bit and 16-bit LDR/STR instructions also support the
>>    unscaled decoding pattern, we add the addressing mode in the
>>...
>
> src/hotspot/cpu/aarch64/aarch64.ad line 5193:
> 
>> 5191:   constraint(ALLOC_IN_RC(ptr_reg));
>> 5192:   match(AddP reg off);
>> 5193:   match(AddP (DecodeN regn) off);
> 
> I'm surprised this works.  If we match on "DecodeN regn", is it really safe to use $reg instead?

Thanks for your review, @dean-long .

Yes, based on the current implementation of our ADL compiler, even if we match on "DecodeN regn", using `$reg` is safe and perhaps even must.

When ADLC is parsing operand interface from `indOffIX`, it always fetches useful information from the **first** match rule `match(AddP reg off)` and does not care about others, even though we have multiple match rules.

See https://github.com/openjdk/jdk/blob/1e04ee6d57d5fe84e1d202b16e8d13dc13c002ff/src/hotspot/share/adlc/formssel.cpp#L2461 and https://github.com/openjdk/jdk/blob/1e04ee6d57d5fe84e1d202b16e8d13dc13c002ff/src/hotspot/share/adlc/output_c.cpp#L3025. 

It searches `reg` in `match(AddP reg off);` and finds that `reg` is the `first` one in all components, which is like `regn` is the `first` in `match(AddP (DecodeN regn) off);`. Then it concludes that the **first** operand starting from `oper_input_base()` is the base address input. In the stage of `emit()`, the node structure has been reduced into like:

Load === ctrl mem reg  val 
Load === ctrl mem regn val 

`off` is saved on Operand field.

The final JVM code will be shown as:

void loadLNode::emit(C2_MacroAssembler* masm, PhaseRegAlloc* ra_) const {
  // Start at oper_input_base() and count operands
  unsigned idx0 = 2;
  unsigned idx1 = 2; 	// mem
  {

#line 2914 "/home/feigao02/chelsea/jdk_src/src/hotspot/cpu/aarch64/aarch64.ad"

    Register dst_reg = as_Register(opnd_array(0)->reg(ra_,this)/* dst */);
    loadStore(masm, &MacroAssembler::ldr, dst_reg, opnd_array(1)->opcode(),
               as_Register(opnd_array(1)->base(ra_,this,idx1)), opnd_array(1)->index(ra_,this,idx1), opnd_array(1)->scale(), opnd_array(1)->disp(ra_,this,idx1), 8);

#line 999999
  }
}

  virtual int base(PhaseRegAlloc *ra_, const Node *node, int idx) const { 
    // Replacement variable: reg
    return (int)ra_->get_encode(node->in(idx));
  }

To be honest, `$reg` here is a little confusing but, IMO, it may represent a relative index. WDYT? Thanks.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16991#discussion_r1622527758