RFR: 8289943: Simplify some object allocation merges [v13]

Sat Oct 22 00:34:50 UTC 2022

On Thu, 6 Oct 2022 16:50:28 GMT, Cesar Soares Lucas <cslucas at openjdk.org> wrote:

>> Hi there, can I please get some feedback on this approach to simplify object allocation merges in order to promote Scalar Replacement of the objects involved in the merge?
>> 
>> The basic idea for this [approach was discussed in this thread](https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2022-April/055189.html) and it consists of: 
>> 1) Identify Phi nodes that merge object allocations and replace them with a new IR node called ReducedAllocationMergeNode (RAM node).
>> 2) Scalar Replace the incoming allocations to the RAM node.
>> 3) Scalar Replace the RAM node itself.
>> 
>> There are a few conditions for doing the replacement of the Phi by a RAM node though - Although I plan to work on removing them in subsequent PRs:
>> 
>> - The only supported users of the original Phi are AddP->Load, SafePoints/Traps, DecodeN.
>> 
>> These are the critical parts of the implementation and I'd appreciate it very much if you could tell me if what I implemented isn't violating any C2 IR constraints:
>> 
>> - The way I identify/use the memory edges that will be used to find the last stored values to the merged object fields.
>> - The way I check if there is an incoming Allocate node to the original Phi node.
>> - The way I check if there is no store to the merged objects after they are merged.
>> 
>> Testing:
>> - Windows/Linux/MAC fastdebug/release 
>>   - hotspot_all
>>   - tier1
>>   - Renaissance
>>   - dacapo
>>   - new IR-based tests
>
> Cesar Soares Lucas has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix x86 tests.

Hi Vladimir, first of all, thank you for reviewing the proposed patch. Sorry for
the delay answering your questions, I'm right now on paternity leave from work
and my time in front of a computer is being quite limited.

> As of now, I don't fully grasp what's the purpose and motivation to introduce
> ReducedAllocationMerge. I would be grateful for additional information about how
> you ended up with the current design. (I went through the email thread, but it
> didn't help me.)

My first implementation to deal with the problem was just replacing Phi's
merging object allocations by sets of Phi's merging field's loads from each
different base. That works fine in some cases. However, one challenge that I
faced in this approach, IIRC, was that the new Phi nodes (merging loads of
fields) were being factored away (eliminated) as part of IGVN optimizations and
consequently the graph ended up again with Phi's merging object allocations. I
considered inserting Opaque nodes in some branches of the graph to prevent
unadverted optimizations but at the end I found the approach below more
"robust".

The proposed idea of using a new type of node to represent the merges have
these Pros, IMO.  

- By replacing Phi's merging allocations by a new type of node (i.e., RAM) the
existing code (split_unique_types) will be able to assign `instance_id`s to the
scalar replaceable inputs of the merging Phi. That means that those inputs will
be scalar replaced using existing code in C2. Additional code will be necessary
only to scalar replace the new type of node.

- As a side effect of the point described above, we'll also be able use existing
code in C2 to find last stored values to fields (i.e., find_inst_mem,
value_from_mem, etc).

- Existing optimizations will not interfere with the scalar replacement (as they
did in the previous approach outlined) because they don't know how to handle the
new type of node. In essence, the new type of node will be Opaque to existing
optimizations.

- We can safely, AFAIU, run igvn.optimize() after replacing the allocation merge
Phis with the new type of node.

- For last, an additional benefit of using a new type of node to store
information, _instead_ of storing it in the ConnectionGraph, for instance, is
that by using graph edges to capture the required "information" that the node
need, the value is never in an "outdated" state. For instance, the current patch
use input slots of the RAM node to store reference to required memory edges.
Anytime a transformation happens in the graph and affect that memory edge the
RAM node will be "automatically" updated using existing code in C2. If, instead,
the memory edge was just stored as part of internal data of some class, then
we'd need to handle those updates manually, AFAIU.

Some arguably Cons of the current approach:

- Seems complex at first glance.

- New (non functional) node in the IR. The node is quite similar to a
PhiNode in the sense that it's there just to represent a state (or some set of
information).

- The new node is a Macro node. Failure to remove it from the IR graph can cause
compilation failure.

> In particular, I still don't understand how it interacts with existing scalar
> replacement logic when it comes to unique (per-allocation) memory slices.

The RAM node itself doesn't interfere with alias indexes / memory slices
creation at all. RAM nodes are created before "adjust_scalar_replaceable_state"
is executed and because of that the allocations participating in merges _may_ be
marked as ScalarReplaceable. Later, "split_unique_types" will run and be able to
assign instance_id's to allocations participating in the merge because there is
"virtually" no merge anymore at this point.

During the Macro node elimination phase the allocation nodes will be visited and
potentially scalar replaced _before_ any RAM node is visited. When an allocation
node being scalar replaced is consumed by a RAM node some information are
"registered" in the RAM node. Those information will later be used when the time
come to scalar replace the RAM node itself.

I hope I have answered your question. If not please let me know and I'll be
happy to give more details.

> How hard would it be to extend the test with cases which demonstrate existing
> limitations?

I'll try and create some test cases that trigger those limitations.

> Also, I believe you face some ideal graph inconsistencies because you capture
> information too early (before split_unique_types and following IGVN pass; and
> previous allocation eliminations during eliminate_macro_nodes() may contribute
> to that).

Can you please elaborate on that?

> Following up on my earlier question about interactions with
> split_unique_types(), I'm worried that you remove corresponding LocalVars from
> the ConnectionGraph and introduce with unique memory slices. I'd feel much more
> confident in the correctness if you split slices for unions of interacting
> allocations instead.

Can you please explain a bit more about this idea? Are you proposing to split the
phi into slices instead of removing it?

-------------

PR: https://git.openjdk.org/jdk/pull/9073