RFR: 8020282: Generated code quality: redundant LEAs in the chained dereferences

Tue May 27 17:32:13 UTC 2025

## Summary

On x86, chained dereferences of narrow oops at a constant offset from the base oop can use a `lea` instruction to perform the address computation in one go using the `leaP8Narrow`, `leaP32Narrow`, and `leaPCompressedOopOffset` matching rules. However, the generated code contains an additional `lea` with an unused result:

; OptoAssembly
03d     decode_heap_oop_not_null R8,R10
041     leaq    R10, [R12 + R10 << 3 + #12] (compressed oop addressing) ; ptr compressedoopoff32

; x86
0x00007f1f210625bd:   lea    (%r12,%r10,8),%r8        ; result is unused
0x00007f1f210625c1:   lea    0xc(%r12,%r10,8),%r10    ; the same computation as decode, but with offset

This PR adds a peephole optimization to remove such redundant `lea`s.

## The Issue in Detail

The ideal subgraph producing redundant `lea`s, or rather redundant `decodeHeapOop_not_null`s, is `LoadN -> DecodeN -> AddP`, where both the address and base edge of the `AddP` originate from the `DecodeN`. After matching, this becomes

LoadN -> decodeHeapOop_not_null -> leaP*
    ______________________________Î

where `leaP*` is either of `leaP8Narrow`, `leaP32Narrow`, or `leaPCompressedOopOffset` (depending on the heap location and size). Here, the base input of `leaP*` comes from the decode. Looking at the matching code path, we find that the `leaP*` rules match both the `AddP` and the `DecodeN`, since x86 can fold this, but the following code adds the decode back as the base input to `leaP*`:

https://github.com/openjdk/jdk/blob/c29537740efb04e061732a700582d43b1956cff4/src/hotspot/share/opto/matcher.cpp#L1894-L1897

On its face, this is completely unnecessary if we matched a `leaP*`, since it already computes the result of the decode,  so adding the `LoadN` node as base seems like the logical choice. However, if the derived oop computed by the `leaP*` gets added to an oop map, this `DecodeN` is needed as the base for the derived oop. Because as of now, derived oops in oop maps cannot have narrow base pointers.

This leaves us with a handful of possible solutions:
 1. implement narrow bases for derived oops in oop maps,
 2. perform some dead code elimination after we know which oops are part of oop maps,
 3. add a peephole optimization to simply remove unused `lea`s.

Option 1 would have been ideal in the sense, that it is the earliest possible point to remove the decode, which would simplify the graph and reduce pressure on the register allocator. However, rewriting the oop map machinery to remove a single lea is a bit overkill. Because the contents of oop maps are not definitive until after global code motion or the register allocator, we might as well just do a peephole instead of performing more DCE, since this only affects x86. So this PR introduces that peephole.

## Changes

This PR
 - adds an x86 peephole optimization to remove `decodeHeapOop_not_null`s with unused results,
 - adds a regression IR test with positive and negative tests from all reproducers for this issue, and
 - adds a microbenchmark to see the effect of the peephole.

The peephole is a bit more powerful than just removing a decode with an unused result preceding a `leaP*`. The peephole can also remove the decode if multiple `leaP*`s have it as base, but its result is still unused, the decode can still be removed. Further, if the removal of a decode will lead to a redundant `MemToRegSpillCopy`, that spill copy will also be removed.

## Microbenchmark Results

Benchmark                                           Mode Cnt  Baseline    Error Peephole    Error Speedup Units
RedundantLeaPeephole.benchStoreNNoAllocParallel     avgt 30      1.471  ± 0.146    1.374  ± 0.056   7.06% ns/op
RedundantLeaPeephole.benchStoreNNoAllocSerial       avgt 30      1.454  ± 0.059    1.345  ± 0.046   8.10% ns/op
RedundantLeaPeephole.benchStoreNRemoveSpillParallel avgt 30     10.789  ± 0.307   10.537  ± 0.302   2.39% ns/op
RedundantLeaPeephole.benchStoreNRemoveSpillSerial   avgt 30     11.364  ± 0.240   11.206  ± 0.165   1.41% ns/op
RedundantLeaPeephole.benchStringEquals              avgt 30      1.355  ± 0.054    1.23   ± 0.033  10.16% ns/op

<details>
<summary>Discussion of microbenchmark results</summary>
The `benchStringEquals` and `benchStoreNNoAlloc*` benchmarks both remove two `lea` instructions and thus exhibit similar speedup. The `benchStoreNRemoveSpill*` benchmarks remove one `lea` and one `mov` from a `MemToRegSpillCopy`. Hence, one would expect a higher speedup. This is the case in absolute numbers, but less significant next to the allocations in this benchmark. The allocations would also explain the higher errors for the `RemoveSpill` benchmarks.
</details>

## Testing

 - [x] [Github Actions](https://github.com/mhaessig/jdk/actions/runs/15281434572)
 - [ ] tier1 through tier2 plus Oracle internal testing for all Oracle supported platforms and OSs
 - [ ] tier3 through tier5 plus Oracle internal testing for x86 on all supported OSs

## Acknowledgements

My thanks go out to @robcasloz for introducing me to the backend, answering my questions, and discussing this issue with me.

-------------

Commit messages:
 - Add microbenchmark
 - Add peephole to remove redundant leas
 - Add regression test
 - Remove trailing spaces

Changes: https://git.openjdk.org/jdk/pull/25471/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=25471&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8020282
  Stats: 723 lines in 6 files changed: 720 ins; 0 del; 3 mod
  Patch: https://git.openjdk.org/jdk/pull/25471.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/25471/head:pull/25471

PR: https://git.openjdk.org/jdk/pull/25471