RFR: 8020282: Generated code quality: redundant LEAs in the chained dereferences [v2]

Thu Jun 12 13:06:42 UTC 2025

On Wed, 4 Jun 2025 11:13:35 GMT, Manuel Hässig <mhaessig at openjdk.org> wrote:

>> ## Summary
>> 
>> On x86, chained dereferences of narrow oops at a constant offset from the base oop can use a `lea` instruction to perform the address computation in one go using the `leaP8Narrow`, `leaP32Narrow`, and `leaPCompressedOopOffset` matching rules. However, the generated code contains an additional `lea` with an unused result:
>> 
>> ; OptoAssembly
>> 03d     decode_heap_oop_not_null R8,R10
>> 041     leaq    R10, [R12 + R10 << 3 + #12] (compressed oop addressing) ; ptr compressedoopoff32
>> 
>> ; x86
>> 0x00007f1f210625bd:   lea    (%r12,%r10,8),%r8        ; result is unused
>> 0x00007f1f210625c1:   lea    0xc(%r12,%r10,8),%r10    ; the same computation as decode, but with offset
>> 
>> 
>> This PR adds a peephole optimization to remove such redundant `lea`s.
>> 
>> ## The Issue in Detail
>> 
>> The ideal subgraph producing redundant `lea`s, or rather redundant `decodeHeapOop_not_null`s, is `LoadN -> DecodeN -> AddP`, where both the address and base edge of the `AddP` originate from the `DecodeN`. After matching, this becomes
>> 
>> LoadN -> decodeHeapOop_not_null -> leaP*
>>     ______________________________Î
>> 
>> where `leaP*` is either of `leaP8Narrow`, `leaP32Narrow`, or `leaPCompressedOopOffset` (depending on the heap location and size). Here, the base input of `leaP*` comes from the decode. Looking at the matching code path, we find that the `leaP*` rules match both the `AddP` and the `DecodeN`, since x86 can fold this, but the following code adds the decode back as the base input to `leaP*`:
>> 
>> https://github.com/openjdk/jdk/blob/c29537740efb04e061732a700582d43b1956cff4/src/hotspot/share/opto/matcher.cpp#L1894-L1897
>> 
>> On its face, this is completely unnecessary if we matched a `leaP*`, since it already computes the result of the decode,  so adding the `LoadN` node as base seems like the logical choice. However, if the derived oop computed by the `leaP*` gets added to an oop map, this `DecodeN` is needed as the base for the derived oop. Because as of now, derived oops in oop maps cannot have narrow base pointers.
>> 
>> This leaves us with a handful of possible solutions:
>>  1. implement narrow bases for derived oops in oop maps,
>>  2. perform some dead code elimination after we know which oops are part of oop maps,
>>  3. add a peephole optimization to simply remove unused `lea`s.
>> 
>> Option 1 would have been ideal in the sense, that it is the earliest possible point to remove the decode, which would simplify the graph and reduce pressure on the regi...
>
> Manuel Hässig has updated the pull request incrementally with three additional commits since the last revision:
> 
>  - Add comment to benchmark as to why we fix the heap size
>  - Add missing null chec
>  - Fix typos

Thank you for your thorough work here Manuel, in particular for carefully exploring and discussing alternative solutions! The peephole approach looks good to me, I just have a few coding suggestions and test questions.

src/hotspot/cpu/x86/peephole_x86_64.cpp line 240:

> 238: 
> 239: // This function removes redundant lea instructions that result from chained dereferences that
> 240: // match to leaPCompressedOopOffset, leaP8Narow, or leaP32Narrow. This happens for ideal graphs

Suggestion:

// match to leaPCompressedOopOffset, leaP8Narrow, or leaP32Narrow. This happens for ideal graphs

src/hotspot/cpu/x86/peephole_x86_64.cpp line 267:

> 265: //           |   /              \
> 266: //           leaP*          MachProj (leaf)
> 267: // In this case where the common parent of the leaP* and the decode is one MemToRegSpill Copy

Suggestion:

// In this case where the common parent of the leaP* and the decode is one MemToRegSpillCopy

src/hotspot/cpu/x86/peephole_x86_64.cpp line 288:

> 286:   bool is_spill = lea_derived_oop->in(AddPNode::Address) != decode->in(1) &&
> 287:                   lea_derived_oop->in(AddPNode::Address)->is_SpillCopy() &&
> 288:                   decode->in(1)->is_SpillCopy();

The logic around `is_spill` could be simplified by declaring pointers to the lea and decode address inputs and setting these either to their direct inputs or one level up in case of spilling. Something like this:

  Node* lea_address = lea_derived_oop->in(AddPNode::Address);
  Node* decode_address = decode->in(1);

  bool is_spill = lea_address != decode_address &&
                  lea_address->is_SpillCopy() &&
                  decode_address->is_SpillCopy();

  if (is_spill) {
    decode_address = decode_address->in(1);
    lea_address = lea_address->in(1);
  }

  // The leaP* and the decode must have the same parent. If we have a spill, they must have
  // the same grandparent.
  if (lea_address != decode_address) {
    return false;
  }

  (...)

This creates more opportunities for simplification below, e.g. rewiring the base address of the affected leas directly to `lea_address` (or `decode_address`).

src/hotspot/cpu/x86/peephole_x86_64.cpp line 297:

> 295:   }
> 296: 
> 297:   // Ensure the decode only has the leaP*s with the same (grand)parent and a MachProj leaf as children.

For unambiguity:

Suggestion:

  // Ensure the decode only has the leaP*s (with the same (grand)parent) and a MachProj leaf as children.

src/hotspot/cpu/x86/peephole_x86_64.cpp line 325:

> 323:   // Ensure the MachProj is in the same block as the decode and the lea.
> 324:   if (proj == nullptr || !block->contains(proj)) {
> 325:     return false;

The only scenario in which I think this may be possible is when we stress scheduling, otherwise `RFLAGS` projections should always be scheduled immediately after their input. Consider adding an assertion like `assert(StressGCM, "should be scheduled contiguously otherwise");` or similar here.

src/hotspot/cpu/x86/peephole_x86_64.cpp line 349:

> 347:   // Remove spill for the decode if it does not have any other uses.
> 348:   if (is_spill && decode->in(1)->is_Mach() && decode->in(1)->outcnt() == 1 && block->contains(decode->in(1))) {
> 349:     MachNode* decode_spill = decode->in(1)->as_Mach();

This could be simplified if `decode_spill` was defined already within the `if (is_spill)` statement proposed above. Furthermore, `decode->in(1)->is_Mach()` is implied by `is_spill`, so not needed.

src/hotspot/cpu/x86/x86_64.ad line 6378:

> 6376: %}
> 6377: 
> 6378: instruct cmovI_rReg_rReg_memUCF_ndd(rRegI dst, cmpOpUCF cop, rFlagsRegUCF cr, rRegI src1, memory src2) 

Please do not touch unrelated lines. We could remove trailing whitespaces in a separate cleanup RFE.

src/hotspot/cpu/x86/x86_64.ad line 6761:

> 6759: %}
> 6760: 
> 6761: instruct cmovL_regUCF_ndd(rRegL dst, cmpOpUCF cop, rFlagsRegUCF cr, rRegL src1, rRegL src2) 

Same here.

src/hotspot/cpu/x86/x86_64.ad line 6868:

> 6866: %}
> 6867: 
> 6868: instruct cmovL_rReg_rReg_memUCF_ndd(rRegL dst, cmpOpUCF cop, rFlagsRegUCF cr, rRegL src1, memory src2) 

Same here.

test/hotspot/jtreg/compiler/codegen/TestRedundantLea.java line 1:

> 1: /*

The new test cases are very thorough and well-documented, but the IR conditions checked are very specific, which might become a maintainability burden as library code, intrinsics, and platform definitions change. Is there anything that could be relaxed while still checking that the optimization is somehow applied (or not applied)?
Two things that come to mind are 1) extracting standalone Java programs that trigger the same scenarios as the current library calls and 2) relaxing the IRNode definitions and the corresponding IR check preconditions (e.g. defining and matching a single `leaP.*` node instead of specialized versions for different heap configurations, or defining and matching a single `decodeHeapOop.*` node).

test/hotspot/jtreg/compiler/codegen/TestRedundantLea.java line 97:

> 95: // The following tests ensure that we do not generate a redundant lea instruction on x86.
> 96: // These get generated on chained dereferences for the rules leaPCompressedOopOffset,
> 97: // leaP8Narow, and leaP32Narrow and stem from a decodeHeapOopNotNull that is not needed

Suggestion:

// leaP8Narrow, and leaP32Narrow and stem from a decodeHeapOopNotNull that is not needed

test/hotspot/jtreg/compiler/codegen/TestRedundantLea.java line 141:

> 139:         for (boolean negativeTest : new boolean[] {false, true}) {
> 140:             for (boolean compressedTest : new boolean[] {false, true}) {
> 141:                 //                                             leaPComperssedOopOffset  leaP(8|32)Narrow

Suggestion:

                //                                             leaPCompressedOopOffset  leaP(8|32)Narrow

-------------

Changes requested by rcastanedalo (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/25471#pullrequestreview-2920901718
PR Review Comment: https://git.openjdk.org/jdk/pull/25471#discussion_r2142587282
PR Review Comment: https://git.openjdk.org/jdk/pull/25471#discussion_r2142588194
PR Review Comment: https://git.openjdk.org/jdk/pull/25471#discussion_r2142604783
PR Review Comment: https://git.openjdk.org/jdk/pull/25471#discussion_r2142611102
PR Review Comment: https://git.openjdk.org/jdk/pull/25471#discussion_r2142620842
PR Review Comment: https://git.openjdk.org/jdk/pull/25471#discussion_r2142632453
PR Review Comment: https://git.openjdk.org/jdk/pull/25471#discussion_r2142642381
PR Review Comment: https://git.openjdk.org/jdk/pull/25471#discussion_r2142642593
PR Review Comment: https://git.openjdk.org/jdk/pull/25471#discussion_r2142642844
PR Review Comment: https://git.openjdk.org/jdk/pull/25471#discussion_r2142677949
PR Review Comment: https://git.openjdk.org/jdk/pull/25471#discussion_r2142637298
PR Review Comment: https://git.openjdk.org/jdk/pull/25471#discussion_r2142638189