RFR: 8020282: Generated code quality: redundant LEAs in the chained dereferences

Wed May 28 12:29:55 UTC 2025

On Wed, 28 May 2025 11:33:13 GMT, Galder Zamarreño <galder at openjdk.org> wrote:

>> ## Summary
>> 
>> On x86, chained dereferences of narrow oops at a constant offset from the base oop can use a `lea` instruction to perform the address computation in one go using the `leaP8Narrow`, `leaP32Narrow`, and `leaPCompressedOopOffset` matching rules. However, the generated code contains an additional `lea` with an unused result:
>> 
>> ; OptoAssembly
>> 03d     decode_heap_oop_not_null R8,R10
>> 041     leaq    R10, [R12 + R10 << 3 + #12] (compressed oop addressing) ; ptr compressedoopoff32
>> 
>> ; x86
>> 0x00007f1f210625bd:   lea    (%r12,%r10,8),%r8        ; result is unused
>> 0x00007f1f210625c1:   lea    0xc(%r12,%r10,8),%r10    ; the same computation as decode, but with offset
>> 
>> 
>> This PR adds a peephole optimization to remove such redundant `lea`s.
>> 
>> ## The Issue in Detail
>> 
>> The ideal subgraph producing redundant `lea`s, or rather redundant `decodeHeapOop_not_null`s, is `LoadN -> DecodeN -> AddP`, where both the address and base edge of the `AddP` originate from the `DecodeN`. After matching, this becomes
>> 
>> LoadN -> decodeHeapOop_not_null -> leaP*
>>     ______________________________Î
>> 
>> where `leaP*` is either of `leaP8Narrow`, `leaP32Narrow`, or `leaPCompressedOopOffset` (depending on the heap location and size). Here, the base input of `leaP*` comes from the decode. Looking at the matching code path, we find that the `leaP*` rules match both the `AddP` and the `DecodeN`, since x86 can fold this, but the following code adds the decode back as the base input to `leaP*`:
>> 
>> https://github.com/openjdk/jdk/blob/c29537740efb04e061732a700582d43b1956cff4/src/hotspot/share/opto/matcher.cpp#L1894-L1897
>> 
>> On its face, this is completely unnecessary if we matched a `leaP*`, since it already computes the result of the decode,  so adding the `LoadN` node as base seems like the logical choice. However, if the derived oop computed by the `leaP*` gets added to an oop map, this `DecodeN` is needed as the base for the derived oop. Because as of now, derived oops in oop maps cannot have narrow base pointers.
>> 
>> This leaves us with a handful of possible solutions:
>>  1. implement narrow bases for derived oops in oop maps,
>>  2. perform some dead code elimination after we know which oops are part of oop maps,
>>  3. add a peephole optimization to simply remove unused `lea`s.
>> 
>> Option 1 would have been ideal in the sense, that it is the earliest possible point to remove the decode, which would simplify the graph and reduce pressure on the regi...
>
> test/hotspot/jtreg/compiler/codegen/TestRedundantLea.java line 287:
> 
>> 285:         phase = {CompilePhase.FINAL_CODE},
>> 286:         applyIfAnd = {"MaxHeapSize", "<1073741824", "UseAVX", "=3"},
>> 287:         applyIfPlatform = {"mac", "false"})
> 
> Doesn't `UseAVX=3` already imply that `mac=false`?

Almost, but not quite. The 2020 model of the Macbook Air and the Macbook Pro 13'' feature 10th generation Intel CPUs supporting AVX512 ([source](https://blog.reyem.dev/post/which-consumer-computers-support-avx-512/)).

Also, both conditions have different purposes here. `mac=false` is set, because on MacOS we cannot guarantee what `leaP*` variant will be generated due to variations in the heap layout due to ASLR. `UseAVX=3` is there, because the test only works in that case.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/25471#discussion_r2111726320